Transcript

00:00:00.000 leading to his first paper, which is affiliated with the Department of Genetics here at UGA.

00:00:15.100 The Green Lab develops machine learning methods to integrate disparate large-scale datasets,

00:00:21.300 develops deep learning methods for extracting contacts from these datasets, and then brings

00:00:26.020 these capabilities to molecular biologists through open and transparent science.

00:00:32.140 Of particular note to folks interested in large language models, like CHAT-GPT, Dr.

00:00:38.580 Green and others recently posted a preprint on how AI might be able to help us write and

00:00:44.260 revise our academic manuscripts.

00:00:47.420 Please join me in welcoming Dr. Casey Green to UGA.

00:00:54.420 Okay, thank you, yeah, it’s good to be back.

00:01:04.700 I actually spent a summer working in this building in the Department of Genetics.

00:01:09.380 Everything around this building looks different now, but somehow the building still looks

00:01:11.860 the same.

00:01:14.500 I’m excited to get a chance to share some of what we’ve been doing in our group and

00:01:19.980 then to share some of what’s going on at the University of Colorado and sort of give you

00:01:23.140 an idea of what the ecosystem is like where we are.

00:01:27.980 I think it’s always important to think about kind of what our role as informaticists in

00:01:34.180 science is, like what do we bring to the ecosystem, and I like to think of what we contribute

00:01:40.140 as essentially serendipity, right?

00:01:42.900 If we do our jobs well, people will see something in their data that they hadn’t seen before,

00:01:46.580 and they will make a different decision based on that.

00:01:49.020 So I like to, you know, think about how we can make more kind of serendipitous moments.

00:01:55.540 I’ll start with a brief vignette of a project that we started quite a few years ago now

00:02:02.940 trying to understand rare diseases.

00:02:07.860 And in particular, what factors might drive a rare, what systemic factors might drive

00:02:13.500 a rare disease.

00:02:16.340 The world when we started this project was one where looking across multiple data sets

00:02:22.260 remained somewhat challenging.

00:02:24.460 It’s time consuming, you have to deal with batch and technical artifacts.

00:02:28.980 And so our question when we started, a postdoc joined the lab and the idea was, well, you

00:02:33.380 know, here’s the gap that we’re facing.

00:02:35.180 If you want to analyze multiple data sets at the same time, to identify systemic factors,

00:02:40.060 that means you’re looking at different tissues, probably looking at different cohorts, there’s

00:02:44.180 potentially different disease contexts, might be different controls.

00:02:48.060 All of this makes your life a lot harder, you can’t just kind of make an assumption

00:02:50.900 that you’re going to do three different t tests and be done with it.

00:02:55.100 And so what this postdoc wanted to do was say, okay, can I find these commonalities

00:02:59.980 without it taking an inordinately large amount of time.

00:03:03.180 And she was very interested in not taking an inordinately large amount of time because

00:03:07.780 the strategy that she used in her PhD, before she joined the group, was developed, she developed

00:03:14.300 this approach using a modular framework to analyze these data sets, where you essentially

00:03:17.740 take different data sets, decompose them into modules, and then try to map a module in one

00:03:23.220 data set to a module in another data set to a module in another data set.

00:03:27.100 This is possible, if you’re an expert in the disease, and you’re an expert in all the tissues

00:03:31.340 that sort of are affected, you can do this, you can say, okay, this is this pathway response

00:03:36.340 in this tissue, and this is this other pathway response in this other tissue, but it’s really

00:03:39.740 time consuming.

00:03:40.740 And the complexity grows essentially, with at least the square of the number of modules

00:03:47.460 that you want to look at.

00:03:48.460 So you sort of restrict yourself to a modest number of modules if you want to do this in

00:03:51.380 any practical amount of time.

00:03:53.580 There are potentially ways to automate this, you know, using over representation analysis

00:03:57.060 or other strategies to try to, you know, make life easier on the mapping stage, so you don’t

00:04:00.940 spend a lot of time looking at stuff that’s unlikely to be fruitful.

00:04:04.220 But either way, it’s challenging, it takes a bunch of time and it requires a lot of expertise.

00:04:10.100 So what Dr. Jacqueline Tironi did was say, well, what I really want is a module library,

00:04:15.980 like how could this process or cell type be represented in any data set that I look at,

00:04:21.220 I’d like to be able to pull that off the shelf, and then take that module library to different

00:04:26.100 data sets, and then look at each of those data sets in terms of those modules and just

00:04:30.420 look directly across those modules.

00:04:31.580 I don’t have to now sort of do the module connection after the fact, I can do it upfront.

00:04:37.500 This would be great, wouldn’t it be great if.

00:04:41.460 And so her hypothesis was that, you know, these modules don’t just exist in one data

00:04:46.220 set, they exist across human biology.

00:04:48.180 So could she, so her hypothesis was that she would be able to learn these reusable modules

00:04:54.540 by taking many, many, many different data sets, decomposing those different data sets

00:04:58.700 into modules, and then sort of learning which modules were necessary to reconstruct the

00:05:04.260 original data.

00:05:06.180 And then once she did that, she could do that on a generic collection of data, and then

00:05:10.180 hopefully be able to use that on her data sets of interest.

00:05:12.860 And in this case, we’re interested in studying a disease called Enka-associated vasculitis,

00:05:17.580 which is rare enough that if you look at large collections of public RNA-seq data, you don’t

00:05:22.060 find it.

00:05:23.680 So it’s quite a rare disease.

00:05:28.500 So the idea here was, if she took a whole bunch of generic samples, essentially random

00:05:31.700 human data that she downloaded from the internet, transcriptomic data, she decomposed that,

00:05:35.900 she could decompose that into patterns, and then she could take those patterns and we

00:05:39.620 could apply those to the rare disease data sets of interest, and potentially do sort

00:05:42.980 of standard statistics with that.

00:05:45.940 We, just to give you an idea of the data set that we started with, so this is a data set

00:05:52.500 ReCount 2, I think there’s now a ReCount 3.

00:05:56.100 This is produced by Jeff Leak’s group at Hopkins.

00:05:59.220 And if you wanted 70,000 RNA-seq samples, you can just get 70,000 uniformly processed

00:06:04.660 RNA-seq samples.

00:06:05.660 What I like to tell Jackie is that her project had to be successful, because if you think

00:06:11.140 about the resources we were giving her to start it, if you just benchmark that those

00:06:14.260 samples probably cost about $1,000 to generate, you know, I like to tell her, look, we gave

00:06:18.780 you $70 million to start your project, you have to do something with $70 million, right?

00:06:23.460 So, this was the data that we used, and then we tried quite a few different methods to

00:06:29.300 extract patterns, including some that we developed in our group, but we ended up coming to this

00:06:34.300 method called PLYR, which is from Maria Shakina’s group at Pitt.

00:06:39.260 And PLYR is the Pathway-Level Information Extractor, and it does a couple things that

00:06:44.340 are really nice.

00:06:45.340 So, it’s essentially doing this matrix decomposition, but as you do the matrix decomposition, there’s

00:06:50.540 a couple sort of regularization factors and some penalties in it, and essentially, it

00:06:55.580 has some sparsity properties that we really like, so the idea is that you want to be able

00:06:59.420 to explain a dataset with relatively few latent variables.

00:07:02.740 Also, you want your latent variables to have only a modest number of genes in them.

00:07:07.300 And finally, if those latent variables can align with a pathway, you’d really prefer

00:07:12.740 that the latent variable align with a pathway, because anytime you’re doing this decomposition,

00:07:16.220 you’re essentially doing some arbitrary rotation, right?

00:07:19.180 PCA is essentially learning an arbitrary, it’s learning a rotation of your data, a reduced

00:07:25.380 dimension space rotation of your data.

00:07:28.380 The rotation in PCA is essentially arbitrary, I mean, you’ve chosen that you want to maximize

00:07:33.340 the variance on the first axis, but you could have chosen anything else, like ICA, you’re

00:07:37.740 just like, I don’t care, just make it small.

00:07:41.020 So in this case, what it’s doing is it’s essentially saying, if a pathway can line up with an axis,

00:07:46.100 let’s do that.

00:07:47.260 And so it’s a little bit less, that’s the intuition for what it’s doing, it’s actually

00:07:52.620 a little bit, the regularization is a little bit different than that.

00:07:56.700 And what that gives you is a really nice level of interpretability.

00:07:59.340 So instead of having to think about a module as, like, instead of saying, okay, this cell

00:08:03.700 type is, these three different axes of variability added in some fraction, usually those cell

00:08:08.980 types are going to come out as a single axis in your data.

00:08:11.060 So it makes it much more easy to think through and reason about the solutions.

00:08:18.100 And so what we essentially did is say, okay, well, we have this enormous collection of

00:08:22.260 generic human data from the internet, ReCount2, and we have a plier, and we’d like that to

00:08:26.940 be a machine learning model that we could use in many different biological contexts.

00:08:30.380 So we named it multi-dataset plier, but because in bioinformatics, everything has to have

00:08:34.900 a name, we just shortened that to multiplier.

00:08:38.780 And so this is kind of a multiplier idea.

00:08:41.380 I’m just going to give you a couple highlight results from the paper that I think help to

00:08:47.540 kind of give an idea of why you might want to do this.

00:08:49.820 The paper is pretty exhaustive.

00:08:51.500 It has a significant deep dive into sort of exactly what’s happening in this stuff, but

00:08:55.340 I’ll just give you kind of the high points.

00:08:59.580 Essentially what we wanted to understand is, does this model learn something we didn’t

00:09:03.660 already know?

00:09:05.740 And is it better than if you just took data from the disease of interest?

00:09:10.180 We couldn’t get enough data from the disease of interest, which is psychosociative vasculitis.

00:09:13.700 So we did that analysis in the paper, and we just didn’t learn many axes of variability

00:09:19.540 because there’s too little data.

00:09:21.020 So what we wanted to do is say, well, let’s pick something where we could actually learn

00:09:24.460 something.

00:09:25.460 Let’s sort of give this idea a chance.

00:09:27.220 And then we said, well, let’s imagine we’re studying a different autoimmune disease, lupus.

00:09:31.220 So that’s what’s here, where it says SLE, this box plot.

00:09:35.660 What we’ve done is we’ve collected all the whole blood data that we could get from individuals

00:09:39.940 with lupus that was publicly available to create one collection of data.

00:09:45.260 Then what we’ve done with RECOUNT2, which is the generic human data from the internet,

00:09:48.780 is we’ve taken RECOUNT2 and we’ve subsampled it to be the same size as the lupus set.

00:09:53.820 So that’s the box plot.

00:09:54.940 The box plot is RECOUNT2 subsampled.

00:09:57.420 And then where you see the diamond, that’s what happens if you just take the complete

00:10:00.460 collection of data from RECOUNT2.

00:10:02.460 Okay, so let’s label our X.

00:10:04.300 Oh, so yeah, so these data sets, if you think about the science behind this, this experiment

00:10:08.220 has these two data sets of the same size, but different composition.

00:10:12.300 These two are the same composition of the data, but they’re quite different sizes.

00:10:16.140 I think 70 times larger for the diamond than the other.

00:10:20.780 And so then we can ask, okay, so how many patterns are we learning from our data?

00:10:23.580 So this method uses the latent variable decomposition, so the number of latent variables, which are

00:10:27.660 essentially patterns.

00:10:29.340 You can see that if you want to learn more patterns, there’s sort of more, and there’s

00:10:32.860 a heuristic and plier for sort of selecting the optimal number of latent variables, and

00:10:37.140 we use that heuristic here, it seems like a pretty reasonable heuristic.

00:10:40.260 It uses cross-validation to essentially ask how frequently you’re rediscovering the same

00:10:45.140 latent variables.

00:10:47.860 So if you do that, you find that you can learn more latent variables or recurrent patterns

00:10:52.900 in the data if you have less heterogeneity in your data given a fixed sample size.

00:10:56.900 I don’t think this is going to shock anyone, right?

00:11:00.420 If I give you a limited amount of data, you would like that data to be as consistent as

00:11:03.900 possible except for the thing you’d like to vary, right?

00:11:06.620 That’s usually a good place to live.

00:11:10.060 So that’s what we see here, right?

00:11:11.340 You get more latent variables out of the lupus data than the subsampled recount2 data, but

00:11:15.740 if I told you, well, you can have really messy data, but you can have a lot of it, now you

00:11:19.460 can learn a lot more patterns.

00:11:21.080 So you end up with many fold more patterns that you can learn, the kind of recurrent

00:11:25.500 patterns across the data set if you have more data.

00:11:27.620 So you’d rather have less heterogeneity, but sometimes having more samples can overcome

00:11:32.780 the heterogeneity issue.

00:11:37.300 So that’s the first thing we ask.

00:11:38.300 So you get kind of more total patterns.

00:11:40.060 The next thing we can ask is, okay, well, we don’t know what complete collection of

00:11:47.440 processes are transcriptionally co-regulated, right?

00:11:50.060 This is not something we know a priori.

00:11:51.460 We can get a collection of processes if we go to Gene Ontology or KEGG or other databases,

00:11:55.940 but some of those may not be transcriptionally co-regulated.

00:11:58.060 However, if we’re seeing a process that’s coming out as transcriptionally co-regulated,

00:12:02.620 that’s probably a positive hit.

00:12:04.860 And so that’s kind of the assumption we’re making here.

00:12:06.700 So this is looking at both the SLE and the ReCount2 data again.

00:12:10.820 This axis is what fraction of the pathways that we know about are coming back as sort

00:12:16.580 of aligned with one or more axes in the data.

00:12:20.580 And so you can see this actually wasn’t driven by composition of the data, this was driven

00:12:24.660 by sample size.

00:12:25.660 So if you put in more data, you can learn kind of more transcriptional co-regulation.

00:12:30.580 This obscures a little bit of what’s going on here because the processes are a little

00:12:34.060 bit different.

00:12:36.460 If you look at what happens in the ReCount2 data, you end up learning, as you get to the

00:12:41.740 large sample size, you end up learning more granular processes.

00:12:45.020 So it seems like it’s probably that you’re covering the same things that are transcriptionally

00:12:48.780 co-regulated, you just get a higher level of resolution.

00:12:52.740 And then, so you kind of learn more of the stuff we know we should know about.

00:12:56.780 So at this point, anything I’m showing you, you could also do with GSEA or any of these

00:13:00.580 other methods, gene set enrichment analysis, or those types of methods.

00:13:03.820 But we can also ask, can we learn anything we didn’t already know going in?

00:13:07.620 So this is asking what fraction of the latent variables that are coming back are not associated

00:13:14.620 with the pathway.

00:13:17.100 And what you can see is, when you have the sort of more modest sample sizes, about half

00:13:23.140 of the latent variables that come back were associated with the pathway, and the other

00:13:25.700 half are potentially novel.

00:13:27.580 Could be novel biology, it could be a technical artifact.

00:13:31.460 What you see when you get the large collection of ReCount2 data is, you know, that number

00:13:35.420 drops to about 20%.

00:13:36.420 So 80% of the latent variables that are coming back didn’t exist in the databases that we

00:13:40.300 had available to us going into it.

00:13:42.100 So that’s a really nice thing to have in your back pocket if you want to say, well, look,

00:13:46.780 I want to explore the biology of what’s going on in this disease, but I don’t want to limit

00:13:50.580 myself to what people have curated into a database.

00:13:53.020 So this is kind of a data-driven way to figure out what those modules could be.

00:13:58.380 And you might also say, well, there’s probably just an enormous amount of technical artifacts.

00:14:01.700 You just told me you gathered a whole bunch of random human data from the internet.

00:14:05.360 One of the positive things that we saw that was kind of suggestive that this is not driven

00:14:09.140 exclusively by technical artifacts is actually this proportion bumps up a little bit when

00:14:13.900 you look at the ReCount2 data, as opposed to the SLE data.

00:14:17.140 If it was entirely driven by technical artifacts, you’d actually expect this to have fewer latent

00:14:21.460 variables that were associated only with a known process.

00:14:25.420 So this was also encouraging.

00:14:28.300 So this gives us the idea that we kind of learn more unknown unknowns.

00:14:31.980 And then there’s quite a bit more of a deep dive in the paper, like looking at sort of

00:14:36.380 one of the things that we see is instead of having a cell cycle latent variable, you end

00:14:39.540 up with different phases of the cell cycle partitioned into latent variables.

00:14:43.660 But the sort of takeaway was that if you do these machine learning analyses while reusing

00:14:49.640 data from other contexts, you can get this level of detail that you couldn’t get just

00:14:54.040 analyzing your data alone.

00:14:55.840 So starting with a whole bunch of other data, learning the pathways and processes there,

00:14:59.520 and then applying it to your data gives you this higher level of resolution.

00:15:02.800 There is a bit of an implicit assumption here, which is if the process that you were looking

00:15:07.040 at is truly unique and only occurs in your setting and no other settings, you can’t find

00:15:12.240 it because it’s not going to be present in the variability from other people’s data.

00:15:16.920 I think this is probably rare.

00:15:19.560 I don’t think it’s terribly common that there are processes that are so exclusive that they’re

00:15:23.400 only used in one and only one biological context and nowhere else.

00:15:27.120 But if you believe that to be the case, you should know you will not find it with this

00:15:29.800 method.

00:15:32.800 And so then just kind of recapitulating.

00:15:34.480 So in the past, when Jackie joined the group, she had this modular framework approach, which

00:15:39.680 is actually really nice and is now used, it’s been used in scleroderma and other contexts

00:15:46.280 to connect pathways across tissues and studies.

00:15:50.240 But the multiplier approach she developed has some nice advantages.

00:15:53.480 So she takes this generic human data, recount two, she can train a model, then transfer

00:15:57.520 it to the datasets of interest, and then just look across those datasets with standard statistics.

00:16:02.320 So this is an example of one of the things we can do with this.

00:16:06.080 So this was actually the thing we wanted to do when we started the study.

00:16:10.040 So these are three different datasets from individuals with ANCA-associated vasculitis.

00:16:15.200 One of the challenges here is that all of these datasets are microarray-based.

00:16:19.840 All of our training data is RNA-seq.

00:16:22.560 A different student in the lab developed a technique to do, if you’re interested in taking

00:16:27.240 sort of machine learning methods and applying them to gene expression data across these

00:16:33.920 contexts, for many methods, there’s reasonable ways to do that transformation that is not

00:16:41.040 completely horrendous, which is the best advertisement I can get for a method.

00:16:47.280 But there’s quite a few different methods.

00:16:48.280 And actually, quantile normalization is not bad in this context.

00:16:51.760 The zeros kind of give you a bit of trouble, but it’s not horrendous.

00:16:55.400 And so this is what we’re doing here.

00:16:57.680 So we’re actually asking, can the multiplier model actually apply to array datasets, even

00:17:03.200 though it’s trained exclusively in RNA-seq data?

00:17:05.000 We would have done this in RNA-seq data, but it turned out there wasn’t RNA-seq data for

00:17:08.920 this that was available yet.

00:17:10.840 So now we’ve gotten to the point where the datasets for this disease are large enough

00:17:13.040 that they actually do exist in RNA-seq as well.

00:17:16.440 So what’s going on here?

00:17:18.080 So we’ve got three different datasets, airway epithelial cells, renal glomeruli, and PPMCs.

00:17:23.160 They’re collected in three different studies, so there’s no matched people.

00:17:26.800 And also the conditions are brutally different.

00:17:28.960 So what’s going to happen is from the left side to the right side of each of these plots,

00:17:32.640 we’re going to go from the least, from the most severe form of the disease to sort of

00:17:36.360 the least severe form of the disease for healthy controls.

00:17:39.360 So this dataset for the airway epithelial cells has, these are the vasculitis data,

00:17:45.320 but then it’s got things like pancreatic rhinitis, healthy.

00:17:49.280 And so we’re basically saying, okay, what is associated with severity across these three

00:17:55.640 different cohorts?

00:17:58.160 And so one of the latent variables that comes up as a severity associated is this M0 macrophage

00:18:03.040 signature.

00:18:04.360 And you can see the same thing where in each group you look in, the least severe form of

00:18:09.640 the disease is on the right, the more severe form of the disease is on the left.

00:18:12.280 So you can see this latent variable severity associated due to the bizarre sort of path

00:18:17.440 of academic publishing.

00:18:20.040 So our guess was that M0 macrophages could be involved here.

00:18:24.280 Well, before this actually was published, but after the preprint came out, we had some

00:18:29.000 follow-up work.

00:18:31.640 And I have to break all the chronology of science.

00:18:34.280 Our follow-up work came out first demonstrating that it looked like there was a change in

00:18:38.400 macrophage metabolism in the disease that could be sort of influencing severity in a

00:18:43.240 systemic way.

00:18:44.880 You can also use this type of analysis to say what’s particular to a tissue, right?

00:18:48.400 So you could say what latent variables are associated with severity in this tissue, but

00:18:52.560 not other tissues.

00:18:53.720 So it gives you the ability to start doing those analyses in a way that it’s pretty darn

00:18:56.920 difficult to do with just the modular framework alone.

00:19:03.760 And then I have an almost five-year-old, she turns five in three weeks, and she’s been

00:19:08.560 watching Zootopia.

00:19:10.680 And there’s a line in a song in Zootopia where I was listening, I’m like, oh my gosh, this

00:19:15.040 is science.

00:19:18.760 So the line is, I’ll keep on making those new mistakes.

00:19:21.680 I’ll keep on making them every day, those new mistakes.

00:19:25.120 And so we’re really big on this in the lab, right?

00:19:27.840 But what I tell people is, it’s not going to work, just make it not work differently

00:19:33.160 each time.

00:19:34.160 If it’s not working the same way each time, that’s not good, but if it’s not working for

00:19:37.240 different reasons, that’s perfect.

00:19:40.080 And so we do this in our own work.

00:19:41.700 So this is the first part of the GitHub that’s rate me that’s associated with this paper.

00:19:47.660 So if you want to know sort of, these are all notebooks, if you want to follow along

00:19:51.480 with the work that we did for this multiplier paper.

00:19:54.360 The first part of this is kind of our proof of concept exploration to just understand

00:19:57.720 how the method worked.

00:19:58.720 Then you get to the stuff that’s in the paper, then you get to the stuff that’s in the supplement,

00:20:03.280 then you get to the stuff that’s neither in the paper nor the supplement, because it turned

00:20:06.080 out the paper was too long.

00:20:08.080 And so this gives you a way to see like, okay, here’s all the stuff we did.

00:20:12.480 So there was one experiment that we did where we wanted to say, can you predict outcome

00:20:15.640 in clinical trials from these latent variables?

00:20:18.220 And so we got this Rituximab data set from the NIH that was testing this.

00:20:23.000 It turned out that the data set structure was, let’s say, suboptimal, in that some of

00:20:27.800 it was paired end and some of it was not paired end sequencing.

00:20:31.320 And this was confounded with the endpoint.

00:20:33.880 So it turned out to be extremely difficult to analyze, and we couldn’t really learn anything

00:20:37.800 from it.

00:20:38.800 But if you’re interested in using that for your own work, probably not that data set,

00:20:43.160 that idea.

00:20:44.160 You know, we’ve got a notebook here that’s like, okay, here was our attempt to build

00:20:48.040 a model to predict response.

00:20:49.240 So you can start from that.

00:20:50.480 So if you’re interested in this, we try to do this for each of our papers.

00:20:55.840 So this is available.

00:20:56.840 The GitHub is here.

00:20:58.400 If you search for Taroni and multiplier, you’ll probably find it.

00:21:03.080 But I thought this was a nice example of kind of how we’ve taken a project from inception

00:21:07.960 through execution through kind of deliverables.

00:21:11.000 This method, we’ve seen some other uses now.

00:21:13.640 So someone used the same thing to study neurofibromatosis.

00:21:17.600 That came out relatively recently.

00:21:19.120 I can’t remember, there’s a few other sort of rare disease analyses that people have

00:21:23.120 started using this for.

00:21:24.120 But we really like seeing that, right?

00:21:25.120 Because it demonstrates uptake that is in a, I mean, rare disease transcriptomics is

00:21:31.120 a relatively small community.

00:21:33.800 So it’s nice to see this stuff beginning to catch on.

00:21:37.880 I would also say, you know, we started with about $70 million worth of data.

00:21:44.240 If you are Lego Grace Hopper and you happen to have an internet connection, you can have

00:21:48.280 about $4 billion worth of data at your fingertips.

00:21:51.600 So if you’re interested, there’s a few more resources that have come online.

00:21:55.800 I think Arches 4 now has like 650,000 samples.

00:22:00.040 So that’s about, you know, if you want to estimate $650 million of preprocessed data

00:22:04.120 at your fingertips.

00:22:05.960 In a previous position, we built something called Refined Bio that’s about a million

00:22:08.920 samples.

00:22:11.240 So these types of resources are available, which is great, because then you don’t have

00:22:14.120 to go back and rebuild this.

00:22:15.680 You don’t have to do all the software engineering to reprocess the data in a uniform way.

00:22:18.960 You just kind of start from the processed data.

00:22:21.560 And I think this opens up a lot of avenues of exploration.

00:22:26.600 I like to, you know, one of the things that I say about what our lab works on is machine

00:22:31.880 learning, public data, and the transcriptome.

00:22:33.680 Pick two of three, and we’re probably interested.

00:22:37.040 One of our Bush essentially wrote and designed the way that we fund science in this country.

00:22:42.840 So this idea that most science is going to happen outside of government research labs,

00:22:46.280 it’s mostly going to happen at universities, it’s mostly going to be grant funded.

00:22:49.800 He wrote this letter to FDR that says, the pioneer spirit is still vigorous within this

00:22:54.080 nation science offers a largely unexplored hinterland for the pioneer who has the tools

00:22:58.080 for his task.

00:22:59.080 Well, I would say, I think open data is like the opportunities here are remarkable, like

00:23:05.840 the ability to, you can take these data sets off the shelf and learn how something works

00:23:11.520 at a scale that’s very difficult to do from the data generated in only one lab.

00:23:15.760 And once you do that, you can then test it.

00:23:17.360 And I think I really think using other people’s data as sort of the starting point to generate

00:23:20.920 hypotheses that you then go test.

00:23:24.080 There’s an enormous amount of unexplored opportunity here.

00:23:28.080 We also think sometimes about other data types instead of just gene expression.

00:23:31.640 So this is work from David Nicholson, who was a PhD student in the lab who just graduated

00:23:35.840 last year, who was like, well, let’s do that.

00:23:38.120 I just want to understand what’s on bioRxiv anyway.

00:23:41.960 So at this point, I probably don’t have to introduce it, bioRxiv is a preprint server.

00:23:46.760 And so this gives us the ability to also study the peer review process in some ways.

00:23:50.560 So we can see what gets posted to bioRxiv, and then we can look at the sort of what the

00:23:54.040 final paper looks like.

00:23:56.160 We started this project just around the time that bioRxiv released an XML repository of

00:24:03.240 their complete collection of data.

00:24:04.920 So if you’re interested in not just having a complete collection of transcriptomic data,

00:24:08.120 you can also go get a complete collection of XML preprints, which I think is really

00:24:11.640 exciting and a lot of fun.

00:24:15.240 You learn some things if you start looking at just the metadata associated with this.

00:24:18.800 So this is one of the simple questions that David asked was just, well, if there are preprints

00:24:25.160 with multiple versions, are people sort of adjusting their preprint in response to peer

00:24:29.160 review?

00:24:30.280 So if someone submits their paper, they get comments back, do they generally repost it?

00:24:34.040 We can’t directly answer that question, we don’t have access to the journal system.

00:24:36.920 But we could say as well, if that were happening, probably what would happen is that at each,

00:24:41.520 you know, as you saw in each version, you’d see an extension in the time to publish.

00:24:45.120 Sure enough, you see that.

00:24:46.120 And actually, the coefficient on the X here is about 50, which is in days.

00:24:53.080 So it suggests that adding a version means sort of 50 days longer in the publication

00:24:57.000 process, which is kind of aligned with what you’d expect to see if people are putting

00:25:00.560 up papers and revising them in response to peer review.

00:25:06.520 Another thing we can ask, so this is getting into the text itself, is if the text changes,

00:25:13.480 do more changes in text between the preprint and the published version result in, does

00:25:18.320 that come with a longer time to publish?

00:25:20.640 And the answer to that is kind of yes-ish.

00:25:23.400 If a preprint changes more from the, if a published version changes more from the preprint,

00:25:28.560 it does take a bit longer to publish.

00:25:31.000 But it’s not an incredibly substantial change.

00:25:33.840 And actually, the other thing that we did, so as we were doing this project, there was

00:25:36.560 another group that did a completely different project, where they took a set of COVID papers

00:25:42.420 that were published first as preprints and followed them through.

00:25:45.480 Their scientific question was different.

00:25:46.800 They wanted to say, for COVID-related papers, does the message of the paper change as it

00:25:52.080 goes through the publication process?

00:25:53.080 And they found that only in one case did that happen out of the 300-odd papers that they

00:25:56.480 examined.

00:25:57.480 But what that gave us was an annotated list of COVID papers.

00:26:00.100 So we could then take that and ask if it had the same relationship, and it actually didn’t

00:26:03.280 have the same relationship.

00:26:04.800 So for the subset of the literature in early 2020, COVID papers were being published quickly

00:26:10.460 regardless of how much text changed between the preprint and published version.

00:26:14.400 So this was kind of an interesting way to explore how publishing was happening.

00:26:18.280 So for those of you who have had the opportunity to have papers go through peer review, can

00:26:24.080 you guess what the most common linguistic change is, if we just look at word-level linguistic

00:26:28.800 change during the publishing process?

00:26:39.200 Has anyone ever had to add supplementary or additional data?

00:26:43.000 It’s not the most common.

00:26:44.480 The most common is actually, oh, no, it is the most common, yeah.

00:26:48.280 So additional and file.

00:26:49.280 So on the right here is what’s enriched in the published literature, and the left is

00:26:53.320 what’s enriched in preprints.

00:26:55.280 So file and additional and supplementary are all pretty high at the top.

00:26:58.800 So when people are changing their papers, we can infer that probably they’re often changing

00:27:04.280 the, you know, they’re adding stuff to the supplement, but maybe they’re not adding that

00:27:08.040 much to the main paper.

00:27:09.640 The other stuff that’s in there is kind of interesting, like fig and figure.

00:27:12.560 So because journals have different styles, the plus-minus symbol and the em dash.

00:27:16.640 So you can see the artifacts of typesetting, but this gives a way to kind of understand

00:27:22.960 what’s on each side.

00:27:23.960 And this, I should say, we’ve done this, so this analysis is using only preprint published

00:27:28.360 pairs.

00:27:29.360 If you do the same thing with all of BioRxiv and all of PubMed, you essentially find field

00:27:32.760 differences.

00:27:33.760 So some fields use BioRxiv more than others.

00:27:35.400 So this is the more carefully controlled than that.

00:27:39.040 The other thing that we’ve done, just if you happen to have a preprint yourself, we have

00:27:42.960 this web server that does a linguistic comparison between a selected preprint and all of PubMed

00:27:51.720 and says, okay, well, here’s journals that publish linguistically similar papers.

00:27:57.320 Here’s papers that are linguistically similar to yours, and this has a secret feature.

00:28:01.640 So what we encourage people to do, and we designed it to try to get people to upload

00:28:04.400 their preprint, but then people are like, well, I have a preprint, but it’s on archive

00:28:06.920 and you don’t support archive.

00:28:08.880 So the secret feature, which you can also drag a PDF over the search box if you want

00:28:11.720 to, but we don’t generally advertise that because the goal was to get people to post

00:28:18.880 preprints so they could use the service, but we don’t support archive.

00:28:21.560 So if you have an archive preprint, we’ll allow you to put, to drag it over the search

00:28:25.760 box, but no other PDFs.

00:28:30.520 So this came out last year, and there’s, again, a GitHub associated with it if you want to

00:28:34.800 see all the kind of exploration that we did on the way.

00:28:38.480 David had a follow-up paper that I’m really excited about where he looked at, he looked

00:28:44.480 at the, what words change their meaning over time, and in the last 20 years of scientific

00:28:51.200 publishing.

00:28:52.200 So there’s an application associated with that as well that I should have put the link

00:28:55.520 on here, but didn’t, that we call WordLabs.

00:28:57.800 So if you go to our lab and look for WordLabs, you’ll find that, and it’s really interesting.

00:29:02.640 So we see things like hallmarks of new technologies, like, you know, CRISPR has a linguistic shift.

00:29:07.520 We also see a lot of pandemic-associated words have linguistic shifts.

00:29:13.080 So if you’re interested in understanding how our language changes, that’s also something

00:29:16.760 that David did.

00:29:19.040 Okay, and then I know this is a less medically-related audience than most of the places that I speak,

00:29:28.360 but one of the things that I thought I wanted to share was sort of how some of this basic

00:29:34.000 science or sort of the techniques that we develop in this basic science can contribute

00:29:37.120 to changes in how healthcare gets delivered.

00:29:40.040 And so this is also something that we think about, right?

00:29:43.480 Remember our business is serendipity.

00:29:44.960 Yes, sometimes that’s in research, right?

00:29:47.080 Whether that’s sort of me telling you how papers change, so that you can think about

00:29:50.560 how you would change your paper in response to peer review, just add more additional files.

00:29:55.160 But sometimes that’s, you know, in care, in clinical care, right?

00:29:59.200 So that someone, you know, you can imagine a patient comes in, there might be reasons

00:30:02.720 that that patient might need to receive a different treatment.

00:30:04.720 Could we provide that kind of information at the point of care?

00:30:07.920 So this is a big focus at our med school and our health system that’s associated with our

00:30:15.000 med school, University of Colorado Health, has an entire program in clinical intelligence.

00:30:21.440 This is sort of the idea that I like to highlight as sort of serendipity is like the right moment

00:30:27.000 at the right time to make the right decision.

00:30:30.440 In most health systems, if someone is going to get something called pharmacogenomic genetic

00:30:35.600 testing.

00:30:36.600 So the idea here is people have different variants in their genome.

00:30:39.580 Some of those variants can affect how you metabolize drugs, how you respond to different

00:30:42.640 drugs.

00:30:44.280 It’s not terribly common that people get tested for pharmacogenomic variants, because if you

00:30:49.600 go get, if you are going to a hospital and, you know, you need to have a stent inserted,

00:30:54.800 one of the common treatments is Plavix, well, there’s a, there’s an interaction between

00:30:59.080 Plavix and a certain genetic variant, a set of genetic variants where you metabolize the

00:31:02.760 drug differently, and it doesn’t work for you, which means you’re not getting the benefit

00:31:05.680 of reducing your heart attack risk, or your clot risk.

00:31:09.800 And so, but most people don’t get this testing, because if a physician orders the testing,

00:31:14.560 they get a 70 page PDF back, then they have to take that 70 page PDF and go to a table

00:31:18.840 like this, read everything related to the drug they’re about to prescribe, to prescribe

00:31:22.560 on the table, and then understand if it applies, right?

00:31:25.360 That is not a common thing for a physician to do.

00:31:28.160 Providers don’t get reimbursed for that type of type of work.

00:31:31.480 What’s happening at the University of Colorado and UC Health is we’ve got clinical decision

00:31:35.320 support built into the electronic health record around this stuff.

00:31:38.280 So this is the same thing, except instead of a 70 page PDF, plus having to look at this

00:31:42.640 table, if a provider were to go in and order Plavix for an individual who’s not going to

00:31:48.640 benefit from it, it pops up an alert that says, look, we recommend you remove this because

00:31:52.160 it’s not going to work, and read about why it’s not going to work if you want to, but

00:31:56.000 we recommend you apply one of these alternatives that will work for this patient.

00:31:59.200 And so this is serendipity, but not just in research, in clinical care.

00:32:04.520 And this is, if you’re interested in this kind of story, this is another one.

00:32:08.400 This was an individual, different condition, where there was a question about drug efficacy.

00:32:12.920 This is a story from UC Health.

00:32:16.120 And in this case, the provider keyed in an order, and an alert popped up that says, oh,

00:32:20.480 this person’s going to need a different dose.

00:32:23.280 And that was helpful to the provider to make that decision.

00:32:26.640 One of the things I’ve had the privilege of doing over the last couple of years is focusing

00:32:31.400 on this program.

00:32:32.400 So a lot of faculty in our department work in this program, and about a year and a half

00:32:37.000 ago, I guess two years ago, the previous director left, and so I ended up as the interim director.

00:32:42.480 So I’ve gotten to know this program pretty well.

00:32:44.920 So this is our Colorado Center for Personalized Medicine.

00:32:48.080 We have a biobank study that this is all tied to.

00:32:50.800 So if someone comes in, they can consent to have their sample collected for the biobank.

00:32:55.080 We have a robust return of results pipeline built on that.

00:32:57.920 So our biobank is growing pretty rapidly.

00:33:00.600 Our sample increase is picking up a lot.

00:33:02.720 But then the other thing we ask is, how is this making a difference in care?

00:33:05.800 So essentially, how many of these alerts are actually firing?

00:33:08.560 So over the last year, we’ve had about 1,000 patients who’ve had an alert fire at some

00:33:14.320 point in clinical care.

00:33:16.360 That’s a tenfold increase over our entire previous history.

00:33:20.200 And that’s been powered because we’ve recently focused on getting these results back into

00:33:26.280 the EHR in a structured way.

00:33:27.560 So we’ve seen almost 100-fold growth, actually more than 100-fold growth year over year.

00:33:31.800 We’ll probably have 210,000 results in the electronic health record at sort of the two-year

00:33:35.800 mark.

00:33:37.080 And what this means is that if you’re interested in studying this type of process in terms

00:33:42.080 of care delivery, if you’re interested in studying how physicians respond, if you’re

00:33:45.600 interested in looking for new cases where there are these sort of drug-gene interactions,

00:33:50.560 we have the ingredients to do that at Colorado in a way that no one else that I’m aware of

00:33:54.240 does.

00:33:55.720 And so this program continues to grow.

00:33:58.200 I’ll just give you one.

00:33:59.240 This is actually a real story that happened over the last few months.

00:34:03.320 So this is a stock photo.

00:34:04.320 I cannot show you a picture of the patient, but a patient came in to a community oncology

00:34:09.280 clinic.

00:34:10.480 And this works across the entire UC health system.

00:34:12.160 So it’s not just an academic hospital.

00:34:13.680 This is a major health system that serves the Mountain West.

00:34:17.360 So this patient came into a community oncology clinic.

00:34:20.280 They were prescribed a drug that based on their genetic variants would have created

00:34:27.480 a significant risk of life-threatening complications.

00:34:33.520 Our team noticed this, sent a message to the provider, and then the patient alert actually

00:34:41.400 fired and recommended a reduced dosage of the drug.

00:34:45.000 The oncologist actually did proactively reduce the dose.

00:34:48.000 So the person started at a different dose than would traditionally be used.

00:34:53.680 Even at that dose, they didn’t tolerate it very well.

00:34:55.600 So they had to further reduce the dose.

00:34:57.840 In these types of cases, you can imagine what happens if you start at the highest sort of

00:35:01.200 the traditional dose at which these drugs can be for individuals with this particular

00:35:05.920 variant can be lethal.

00:35:07.720 And so this is a case where, you know, yes, I told you there’s 1,000 alerts, but each

00:35:11.680 of those 1,000 alerts is some story like this, right?

00:35:14.360 And so it’s nice to see this actually being used to deliver care at scale.

00:35:19.640 And so we’re doing this, this is all informatics, right?

00:35:23.040 You can get all of this serendipity with sort of none of this here has machine learning

00:35:27.160 built into it, but it’s going to.

00:35:30.480 And as we think about that, I think it’s really important not just to sort of think

00:35:33.920 from the machine learning point of view, but to really think about practical clinical care

00:35:37.360 pathways.

00:35:38.360 So this is a piece from Siddhartha Mukherjee that sort of, if you’re interested in AI and

00:35:44.160 medicine, I realize this is dated now, but it’s still worth reading.

00:35:49.040 And it’s also weird that five years is old, but it’s still worth reading.

00:35:54.320 It has a quote from Geoffrey Hinton, sort of says they should stop training radiologists

00:35:57.800 right now.

00:36:00.440 And why would someone say this, right?

00:36:03.280 Well, so Geoff Hinton’s looking at the literature, right?

00:36:05.480 So they’re just trying to collect some literature from around the same time.

00:36:08.760 So this is sort of saying, look, deep learning is going to completely transform healthcare.

00:36:11.680 It’s going to change how we care as we know it.

00:36:14.840 Another sort of similar example, more examples, everything you read in the literature, deep

00:36:20.000 learning.

00:36:21.000 Like, I mean, now we’re all into large language models, but at the time these image models

00:36:23.760 were going to completely transform healthcare.

00:36:25.600 Well, you might ask, is there anything they’re not good at?

00:36:29.080 Like they’re good at everything, right?

00:36:31.440 Chihuahuas and blueberry muffins, not terribly good here.

00:36:37.000 This one’s kind of wild.

00:36:42.520 So I perceive the thing on the left as a panda.

00:36:44.840 It looks like a picture of a panda.

00:36:48.160 I don’t really perceive this as much of anything.

00:36:51.640 We’re going to add, you know, this plus seven thousandths of this.

00:36:56.520 What do you think the output is going to look like?

00:37:02.120 A sloth.

00:37:03.120 A sloth?

00:37:04.120 Any Gibbons?

00:37:05.120 Anyone for Gibbon?

00:37:06.120 A bear.

00:37:07.120 A bear?

00:37:08.120 Okay.

00:37:09.120 So this is what it looks like.

00:37:10.120 It’s a Gibbon.

00:37:11.120 Clearly a Gibbon.

00:37:12.120 No question.

00:37:13.120 That’s a Gibbon.

00:37:14.120 A monkey.

00:37:15.120 Okay.

00:37:16.120 So it doesn’t look like a Gibbon to me, but our neural network is exceedingly convinced.

00:37:26.080 And the reason this works, right, is because the decision boundaries in these neural networks

00:37:29.200 are sort of nonlinear, and you can end up pretty close to a decision boundary without

00:37:32.360 really knowing it.

00:37:34.200 This is another example, which I think is just a lot of fun, so I have to throw it in.

00:37:38.160 So this is someone who was like, well, can I just make adversarial, like, sticker?

00:37:43.040 Like, can I have a sticker that I can put on something and have a deep neural network

00:37:46.960 perceive it as something, even if it’s not?

00:37:49.360 So this is the toaster sticker.

00:37:52.600 And so you can give, you can put the toaster sticker on a table next to what I perceive

00:37:56.960 to be a, I perceive this to be a toaster sticker on a table next to a banana.

00:38:01.480 I perceive this to be a banana on a table.

00:38:05.320 Neural network classifier, banana on a table.

00:38:07.240 It’s good.

00:38:08.240 Stick the toaster sticker next to it?

00:38:09.960 Absolutely a toaster.

00:38:10.960 You can imagine doing the same things with stop signs, you can imagine doing the same

00:38:15.520 things with other fun technologies in the world of self-driving cars or deep neural

00:38:21.120 networks or vision.

00:38:22.120 Okay, so let’s go back to the automated radiologist finding pneumonia.

00:38:26.200 This was one of the examples I showed you before.

00:38:29.200 There’s a, John Zek is a guy who writes blog posts that are really good and then converts

00:38:34.720 them into papers.

00:38:36.600 So this was a blog post from John Zek, which then became a paper that sort of said, okay,

00:38:41.480 why is the system actually working?

00:38:43.120 Like what’s happening here?

00:38:45.200 So he went to the, to the images and tried to understand, you know, what part of the

00:38:51.080 image is contributing to the prediction of pneumonia?

00:38:53.920 Well in this case, a positive, and this should probably find pneumonia.

00:38:58.480 There’s high density in the lung here.

00:39:01.760 You can say, well, okay.

00:39:02.760 Oh, and I should say a positive number is suggestive of pneumonia, a negative number

00:39:09.240 is not.

00:39:10.240 Okay.

00:39:11.240 So what are the positive numbers?

00:39:15.360 The most positive numbers, bottom, yeah, where there’s this interesting stripe, the interesting

00:39:22.720 stripe that’s kind of unique characteristic of the scanner, the one that you might see

00:39:26.800 if you were in a department, if the scanner was placed proximal to sort of where pneumonia

00:39:31.440 diagnoses were usually occurring.

00:39:33.840 Yeah.

00:39:34.840 So, so, so that’s interesting.

00:39:38.240 Here’s another one.

00:39:40.800 I should say these probabilities are low, but the previous one was in the 99th percentile

00:39:44.320 for pneumonia.

00:39:45.320 This one’s in the 95th percentile.

00:39:47.320 Yeah.

00:39:48.320 Portable?

00:39:49.320 Portable.

00:39:50.320 Someone’s not well enough to go to the scanner.

00:39:52.600 Not a good sign for their health, could indicate pneumonia.

00:39:55.860 Probably not the part of the image you want to be looking at.

00:39:59.320 And so, you know, I think this to me, I was also reading this every once in a while, I

00:40:05.200 read the comment section of blog posts on the internet, which I do not recommend, but

00:40:08.960 there was one on big data.

00:40:09.960 You know, there was this blog post on big data and I’m like, well, it was like why statistics

00:40:14.320 don’t matter in the era of big data or something like that.

00:40:16.800 And I’m like, okay, I’ll, you know, and then I’m like, I have some disagreements with this.

00:40:21.040 Then I go to the comment section, I find this one, which I really agree with.

00:40:25.720 On big data, data collection biases are always larger than statistical uncertainty.

00:40:29.680 And I think this is why I sort of, you know, you can have these models that perform robustly

00:40:33.560 around a lot of these comparisons that still struggle.

00:40:36.200 And then I read, who made the post, and it’s this guy named Daniel Himmelstein, which probably

00:40:39.240 doesn’t mean anything to you, but he happened to be a postdoc in my lab.

00:40:42.160 And I’m like, dude, put this in the comment section of the internet, people should actually

00:40:46.440 read it.

00:40:47.440 So now I put it on slides, so at least someone can see it.

00:40:51.120 Okay, so how do we design systems that work?

00:40:54.600 So you know, for AI and medicine, regardless of what we’re using, I think we need to have

00:40:59.080 some principles that we think about.

00:41:01.520 We just did something that sort of tries to go on this path.

00:41:05.400 So this is a little bit different.

00:41:06.960 So this is a piece from Nature, but it talks about a preprint that we put up earlier this

00:41:11.800 year, where we were trying to use GPT-based models, so in this case, GPT-3, to revise

00:41:17.600 academic manuscripts.

00:41:19.280 One of the things that we see all the time is, well, now it’s really common, right?

00:41:23.440 People are developing services that can use GPT-3 or ChatGPT, or GPT-4 through ChatGPT

00:41:30.240 to revise your manuscript.

00:41:31.240 Well, what are the issues with those?

00:41:33.560 So our experience putting this together and running a bunch of test manuscripts through

00:41:36.360 it is that, yes, it is good at clarifying language.

00:41:40.280 There’s a lot of things it can help you with.

00:41:42.800 It actually caught an error in an equation, which I was pretty darn impressed with.

00:41:47.480 On the other hand, it also makes stuff up.

00:41:50.640 And so a bit of a challenge if your idea is that you’re just going to use this.

00:41:54.480 And I think, you know, I use this as an example because it’s really trivial.

00:41:59.240 You can try it out yourself and see how it works, and you can get these examples yourself.

00:42:03.360 The same things are going to be true when we use this in medical context, so we need

00:42:06.840 to think about them.

00:42:07.840 I think thinking about them and trying them and experimenting in a low-risk environment

00:42:11.440 before we move to a high-risk environment is usually a good idea.

00:42:14.440 So what are the principles that we’ve kind of come up with as we think about this?

00:42:17.440 So first, we really do aim for kind of an augmentation, not replacement.

00:42:22.440 And what that means is, you know, when we apply this to manuscripts and we try to get

00:42:27.760 it to improve it, you know, we actually applied it to the manuscript about the tool.

00:42:32.520 And when we did that, it made up this thing that we had done that we had fine-tuned the

00:42:36.200 model on manuscripts of a similar type.

00:42:38.240 Yeah, you should absolutely fine-tune the model on manuscripts of a similar type.

00:42:41.920 Makes a ton of sense.

00:42:43.200 We didn’t do it.

00:42:44.200 You probably shouldn’t report that you did it in the manuscript.

00:42:48.360 So we’re really thinking like, you know, okay, this is not like you’re going to plug it in,

00:42:52.280 you’re going to be done.

00:42:53.280 It’s really, you need to design it around kind of an augmentation capability.

00:42:56.760 You’ve got to carefully consider your use cases.

00:43:00.000 You know, if it’s easy to take the output, it’s hard to compare.

00:43:03.280 It’s probably not good, right?

00:43:04.800 Because you’re creating the opportunity for a mistake that you don’t need to create.

00:43:09.240 We really like to start with these kind of simple solutions and approaches and layer

00:43:11.960 complexity only as needed.

00:43:13.320 So in this case, you know, we start with some pretty simple prompts and then have the ability

00:43:20.080 to add complexity.

00:43:21.280 But usually we just kind of try to keep it relatively basic and simple.

00:43:25.200 The workflow is simple.

00:43:26.640 You can proof of concept it out really quickly.

00:43:28.360 And the most important thing is preserving attribution.

00:43:30.440 Like where did the content come from?

00:43:31.680 If you’re thinking about this in a clinical setting, you know, what was provided and when?

00:43:35.760 And you know, did it come from an AI or a human first?

00:43:38.000 Because that’s really going to matter as you’re thinking about evaluating these workflows.

00:43:41.520 In academic writing, I think it’s going to, you’re going to want to keep track of whether

00:43:45.200 something came from an AI based system or if you wrote it.

00:43:48.960 I think this is, more and more journals are starting to require this.

00:43:51.600 It’s going to be important.

00:43:53.520 But I just think like these are key principles that I would recommend keeping in mind.

00:43:59.160 And then finally, I just want to give you an idea of what the environment is like about

00:44:01.680 at Colorado.

00:44:02.680 Because some of you may one day look for future jobs and I figure you should know something

00:44:05.840 about us.

00:44:06.840 We’re not at Boulder.

00:44:07.840 We’re at the Anschutz Medical Campus, which is kind of between the airport and the Denver

00:44:13.040 airport and Denver itself.

00:44:16.360 We are a major academic medical center.

00:44:19.240 And like I said, we’re not at Boulder, which is the thing that people most frequently get

00:44:24.480 confused about.

00:44:25.780 On July 1st of last year, we actually launched a new department of biomedical informatics.

00:44:30.980 And you know, we’re trying to hire and put together a faculty that are focused on this

00:44:34.020 idea of kind of making serendipity routine, like how do you surface the right information

00:44:37.940 at the right time.

00:44:40.220 Our 30, we’re now at 31 faculty, we have a new person starting in May, that will get

00:44:45.260 us to 32 faculty, have about $65 million in extramural research, just for faculty who

00:44:50.340 are PIs, on which faculty in the department are PIs, there’s a lot of additional collaborative

00:44:55.340 funding that’s not included in this.

00:44:57.620 We have expertise kind of across the spectrum from precision medicine, through kind of physiological

00:45:03.380 modeling, we have folks who think about human computer interaction, because if you want

00:45:06.060 to deploy this stuff in the clinic, you should really think about how humans are going to

00:45:08.740 interact with it through kind of electrical engineering, medical imaging and AI.

00:45:13.580 So there’s a lot of faculty working in this area.

00:45:16.940 This I just collected some stuff from early last year.

00:45:21.100 And that just sort of like, okay, when our faculty mentioned in the press, so there’s

00:45:24.660 stuff in MIT technology review nature, LeMond, Deutschland, I can’t remember whatever the

00:45:32.540 German public radio station is.

00:45:35.500 So you know, you have a bunch of internationally renowned experts who are at Anschutz, we don’t

00:45:38.700 happen to be at Boulder.

00:45:39.700 You know, to me, it’s like the difference between Georgia Tech and Georgia, I feel like

00:45:44.740 they’re different institutions, we should probably occasionally recognize that.

00:45:49.660 The other thing that sort of we focused on when we were creating the department, if you

00:45:53.460 read the sociology literature from the 80s, which I suspect all of you do on a regular

00:45:57.060 basis, there’s this article that I actually think you should read from the sociology literature

00:46:02.020 from the 80s called the mundanity of excellence.

00:46:04.740 So this is someone who essentially studied swimmers at many different levels and asked

00:46:08.260 what differentiates swimmers at one level from another level.

00:46:12.700 There’s a few different principles that come out, but one of the key ones is that excellence

00:46:16.260 requires qualitative differentiation.

00:46:18.620 And what I mean by that is, you know, you’re not going to move up a level in swimming competitions

00:46:22.180 by swimming to extra labs.

00:46:24.020 That’s not how it works.

00:46:25.020 What you’re going to do is you’re going to focus on your form, you’re going to, you know,

00:46:28.020 you’re going to approach the sport differently, you’re going to focus on getting rest the

00:46:30.920 night before the meet, like that’s the stuff that people do at higher levels that they

00:46:34.340 don’t do at lower levels.

00:46:36.660 And as we think about a department, we had to ask like, what’s our, okay, what’s our

00:46:39.380 qualitative differentiator?

00:46:40.380 How are we not just like another biomedical informatics department that just happens to

00:46:44.220 have more money or something, right?

00:46:45.580 Like that’s not, that’s not a real differentiator.

00:46:48.460 And so what we thought about was creating promotion and tenure guidelines that have,

00:46:53.580 that are focused on real world impact.

00:46:55.720 So one of our bullet points in our departmental sort of idea of impact, there is a bullet

00:47:00.700 point that includes publication.

00:47:02.220 It’s possible, you can do it, we can care about it.

00:47:05.540 But there’s also technology development that gets deployed locally, nationally or internationally.

00:47:10.780 Software shows up, changes in policy because of your work.

00:47:14.100 All of that is included in and counts for impact.

00:47:17.100 Now, probably not all of you will look for tenure track faculty positions in our department.

00:47:21.340 But if you were to come train or send folks to train with us, I think it’s important to

00:47:26.660 know like that filters down, right?

00:47:28.340 If that’s how faculty are evaluated, that filters all the way down.

00:47:30.940 So there’s an emphasis on real world impact that we have that I think can be our kind

00:47:34.340 of qualitative differentiator.

00:47:35.340 And I’ll just say, we have a really good training environment.

00:47:39.620 So we have strong connections with UC Health.

00:47:41.100 So I told you earlier about one of the UC Health programs that we work closely with

00:47:44.660 them on.

00:47:45.660 And Children’s Colorado, which is a nationally renowned pediatric cancer hospital, not pediatric

00:47:51.220 hospital.

00:47:52.220 I know the pediatric cancer people work in that space.

00:47:55.940 So we have those tight connections.

00:47:56.940 If you’re interested in seeing your work translate to care, we have, if you’re interested in

00:48:01.140 using genetics to guide care, I think we have one of the best programs in the country

00:48:03.620 through CCPM.

00:48:05.340 We have a diverse and internationally recognized faculty.

00:48:08.740 One of the things that’s sometimes a little bit odd for folks at, you know, our tenure

00:48:12.680 track faculty in DBMI are actually majority women, which I think is uncommon at, in, in

00:48:19.620 our field.

00:48:21.380 And then if you like the climate and you, it’s a little bit humid here.

00:48:24.900 We don’t have that level of humidity, but we do have more hours of sun per year than

00:48:28.340 Miami and San Diego.

00:48:29.340 So if you’re interested in the environment around you, we’ve got that.

00:48:31.860 And then we’ve got abundant outdoor activities.

00:48:34.780 This is one example of the programs that we have.

00:48:36.580 So this is our computational science PhD program.

00:48:39.020 There’s also a postdoc training grant associated with the same thing.

00:48:44.740 So if you’re interested in this type of thing, feel free to look us up.

00:48:47.780 You can always drop me an email and I can try to connect you with folks too.

00:48:51.580 And then with that, I just want to thank the people who make this possible.

00:48:55.860 So the members of the lab, we really have a kind of robust culture in the lab of sort

00:49:00.140 of sharing the work that’s happening and kind of thinking through each other’s projects

00:49:04.700 in ways that are really helpful.

00:49:06.420 We also do code review.

00:49:07.420 So people really pitch out, pitch in together, the department of biomedical informatics and

00:49:12.060 my leadership team, and then the folks in CCPM, since I shared some of that work, and

00:49:15.660 then the folks who give us money, I’d be happy to take whatever questions you have.

00:49:25.860 For the radiology, XAI explainability, did they end up using my grad cam as like how

00:49:35.340 they liberated the convolutional layer for the confluence?

00:49:41.380 Yeah.

00:49:42.380 I don’t remember what strategy they used.

00:49:45.580 And I also don’t remember, you know, the saliency map.

00:49:51.220 I think it’s a saliency map, but I’m not 100% sure.

00:49:55.580 Yeah, there’s a, so John posted that first as a blog post, there’s now like a PLOS medicine

00:50:00.300 paper, I think.

00:50:01.300 Yeah.

00:50:02.300 I think I probably, yeah.

00:50:03.300 Yeah.

00:50:04.300 And I wrote this when it was the blog post and not when it was PLOS medicine paper, which

00:50:07.020 is what I’d look at.

00:50:09.020 Yeah.

00:50:10.020 Yeah.

00:50:11.020 So with the strange like explainability maps, did that happen even with like augmentations

00:50:19.540 that would rotate or like zoom in, zoom out?

00:50:22.100 Like was that with that, it still happened or with a train without that?

00:50:26.300 Because like, I was thinking like assume in augmentation might solve that bottom band

00:50:32.540 problem.

00:50:33.540 Yeah.

00:50:34.540 So I was wondering if that still.

00:50:36.780 I’m guessing, so this is Andrew Ng’s stuff.

00:50:39.180 This was the one that he was criticizing.

00:50:40.980 I think this was just chest x-rays and I don’t think they, I don’t remember them doing at

00:50:48.820 least like a patch-based augmentation or anything like that.

00:50:52.020 Yeah.

00:50:53.020 And I would say now some of the techniques that are more sophisticated are likely to

00:50:57.100 control for some of that.

00:50:58.100 I mean, the other thing you could do is you can also just like do some adversarial training

00:51:01.540 around the location of the scanner.

00:51:04.500 The challenge with that is you need to know to do it and to know to do it, you have to

00:51:07.940 like have someone who’s an expert probe your data.

00:51:10.220 And I think sometimes when we come to things from a computer science perspective, I do

00:51:16.300 think sometimes we get really excited that something is working and especially if it’s

00:51:22.340 working as well as a human and maybe we get a little bit ahead of ourselves and aren’t

00:51:31.780 skeptical enough about our own results.

00:51:36.660 So with the pharmacogenomics, the alert system that you pointed out.

00:51:41.020 So how often do you get sort of false alerts, right?

00:51:44.100 So because sometimes you can get an alert that a physician may not be really interested

00:51:52.580 in or think is valid, right?

00:51:54.940 Yeah.

00:51:55.940 So we designed this pretty carefully.

00:51:59.940 So we could bump up our alert number by just whether or not someone’s getting a relevant

00:52:05.580 prescription, fire the alert.

00:52:08.380 That’d be great for our metrics.

00:52:09.620 On the other hand, not terribly useful for care and people would learn to ignore it.

00:52:13.660 So we’re pretty focused.

00:52:15.060 So most of the alerts are non-interruptive.

00:52:17.260 So the idea is an 80-20 rule.

00:52:19.500 So only 20% of the alerts should be interruptive, 80% should be non-interruptive.

00:52:24.140 And because this isn’t based on a predictive model, it’s pretty straightforward to make

00:52:29.660 sure it fires largely at times when it’s relevant.

00:52:32.220 So restricting kind of the clinic in which you can fire and that sort of stuff.

00:52:36.180 However, one of the really nice things about UCHealth, so I’ll go back on my advertising

00:52:39.780 pitch for why you should come to Colorado and work at Colorado if this is something

00:52:42.500 you’re interested in, UCHealth thought about this in advance.

00:52:46.660 So years ago, they built a virtual health center.

00:52:49.620 And if you want to look it up, you can look up the UCHealth virtual health center.

00:52:53.500 The guy who, to my understanding, put this together and sort of was visionary behind

00:52:58.060 it is a guy named Rich Zane, who’s our chief innovation officer through the hospital.

00:53:03.060 And what they did there is they have nurses and clinicians who work offsite, but who are

00:53:07.780 available to look at these types of systems before they flow to people who are onsite

00:53:12.900 providing care.

00:53:14.260 And where this became really useful, so is anyone aware of kind of the EPIC sepsis model

00:53:19.260 thing that blew up maybe a year or two ago?

00:53:22.820 No.

00:53:23.820 So EPIC is one of the major providers of electronic health record systems.

00:53:26.860 They have a sepsis model, and that sepsis model is pretty noisy.

00:53:31.300 It likes to, it alerts probably more frequently than it should, and it misses cases it shouldn’t

00:53:36.380 miss.

00:53:37.420 So there was a team at CU before my time, led by Tal Bennett, that had evaluated this

00:53:43.020 model and found that it had some predictive quality, but maybe it wasn’t ready.

00:53:50.180 I want to be careful what I assert here.

00:53:52.460 It had some predictive quality, but deployed in practice at scale could have created a

00:53:57.940 lot of unnecessary burden on providers.

00:54:00.620 Well, what they did, because they have the virtual health center, is they’re able to

00:54:03.620 deploy that model plus others in the virtual health center, have it alert nurses and clinicians

00:54:11.060 there, and then have them look at it carefully in the virtual health center and only send

00:54:16.260 the notice over to the folks who are working at the bedside if it’s actually going to be

00:54:20.100 useful.

00:54:21.100 So a reason that could be good to work at Colorado, if you’re interested in kind of

00:54:23.620 predictive analytics and deploying this stuff in practice, is you can have a model that’s

00:54:27.460 not perfect, right?

00:54:28.620 It doesn’t have to be good enough to hand to a provider at the bedside.

00:54:33.820 Because of the virtual health center, you can really proof of concept it out there,

00:54:37.460 improve it, understand how you can improve its predictive quality, and then deploy it

00:54:40.060 when it’s ready.

00:54:41.060 But you can still get the benefit in the meantime.

00:54:42.660 So I guess I’d say, yeah, I think our noise level is pretty low on these alerts, but we

00:54:49.220 do have a system in place for noisier stuff if people want to deploy it.

00:54:57.900 It’s fun to be back in Athens.

00:54:58.900 Time to go dogs.

00:54:59.900 I don’t know what else I should say, but I’m just excited to be back.

00:55:04.220 I was really tickled when I got the invite, it was wonderful.