Open Source Foundations for Modern Drug Discovery | Absci

Home

“DeepChem: Towards Open Source Foundations for Modern Drug Discovery” presented by Absci Invites: Seminar Series

We recently hosted Bharath Ramsundar for our #AbsciInvites seminar series. Bharath presented his work on DeepChem, an open-source toolchain that seeks to democratize the use of #deeplearning in drug discovery. #unlimit

Disclaimer: Views and content presented by Bharath Ramsundar are his own and should not be attributed to Absci.

Presentation Transcript:

Deniz Kural:
Hi. Hello everyone, and thank you for joining us today. I’m Deniz Kural, SVP of Antibody and Target Discovery here at AbSci. And I’m really excited to welcome Bharath Ramsundar, the CEO of Deep Forest Sciences today. Bharath is going to be discussing the importance of open source foundations in modern computational drug discovery efforts with us today. But before we get started, just a note about questions. We encourage you to ask questions throughout the presentation. If you have a question, please press the raised hand button at the bottom of your screen and we’ll call you in real time on audio. Then a popup will appear asking you to unmute, which you’ll need to click before we can hear you. We’re recording this for distribution on YouTube. So, if you’d prefer to enter your question using the Q&A window, that’ll work as well, and I can just ask them out loud. So with that, I’ll hand the controls over to Bharath.

Bharath Ramsundar:
Awesome. Okay, let me go ahead and dive into it. First, thank you all for joining in today. Today, I’ll be speaking to you a bit about DeepChem towards open source foundations for drug discovery. Some of you may have already heard about DeepChem as a project, but for those of you who haven’t: DeepChem as a framework to, very broadly, apply AI to scientific problems. We started life as one of the first libraries out there that provided good tools for applying graph convolution to molecules. And since then, the community’s grown outwards. And I think we now have one of the most sophisticated open source scientific machine learning frameworks communities out there. So DeepChem has dozens of different machine learning models. It has again, dozens of different inbuilt data sets and scientific featurization and transformations.

Bharath Ramsundar:
And it has also been used, I think, in a very broad range of scientific papers and contributions. Last count I did, several hundred papers at the minimum that have used DeepChem to do meaningful things. Actually, let me… so, several organizations have used DeepChem to fruitful purpose. So I think every kind of company or academic institution listed here has written either a paper or blog post with DeepChem, or has publicly discussed DeepChem their usage. So, I think we’ve had a pretty decent impact in terms of people actually using us to do drug discovery. And, among other things, there’s also nearby fields. Things like designing better pesticides or herbicides or antibiotics or for environmental purposes. So there’s a broad range of tools that we support, usages that we support to our open source community.

Bharath Ramsundar:
We have a quite active and growing community of users that have been using DeepChem to do things. One of the coolest things I think about, on the project, is that we have developers from across the world. I think we have active developers… India, throughout the U.S., throughout Europe, South America… Sorry, have a scam call there. And Japan, among other places. So, I think we have one of the largest global communities of distributed researchers centered around building open source tools. And I think part of the magic of open source, I think it enables these collaborations and connections that you wouldn’t have made otherwise. So, even if you’re at a top-tier institution, place like Stanford or MIT, you see people who are at that campus but who you don’t meet is the brilliant kid, maybe out of Peru, who would not have had a chance to come to MIT, but is, by happenstance, an amazing programmer.

Bharath Ramsundar:
So, I think that’s the type of connections that DeepChem enables and that’s part of the reason I continue to spend time on open source efforts over the years. So, part of what I’m hoping to do today is one, provide just a rough introduction to what DeepChem is. And two, I’ll say a little bit more about some of the science that we’ve enabled over the last couple of years and talk about some of the recent pre-prints and manuscripts that have come out of the project. So, this is a very high level, a rough introduction to the structure of the DeepChem program. This is taken from our open source research forum. This is a diagram that was put up, talking through the structure of DeepChem. So DeepChem, on a high level, is a framework for constructing scientific, machine learning workflows. Where you essentially take in input data, specify a series of transformations, and then, at the output, get predictions or other structured output that tells you, say, what molecules to try, what sequences to look at, what images to focus on at the other side.

Bharath Ramsundar:
So DeepChem allows you to pick and choose every one of these components. So there’s literally dozens of choices for each of the boxes you see here. So dozens of models, dozens of featurizers, different ways to split data, transform data. And when you add all these kind of combinatorial choices up, DeepChem supports a very broad range of scientific programs within a simple, but very flexible, framework. And I think part of the simplicity here has been important for getting people onboarded because unlike many standard scientific tool chains, you see a need to deeply understand partial differential equations or statistical thermodynamics in order to be able to use. You can understand some basic principles of how to work with these systems and then start using that to do meaningful science, which we found is just an amazing tool to onboard newcomers to working with DeepChem.

Deniz Kural:
So, speaking of newcomers, in your previous slide there seems to be an inflection point happening in early March, 2020 or so. What do you think happened?

Bharath Ramsundar:
Well, there’s a basic human thing there where it’s like… So, if you can see… When I was doing my… I started the project when I was doing my PhD and it was growing quite rapidly. But then, in the middle there, I decided I made a mistake of doing a startup that wasn’t related to DeepChem. And continued working on DeepChem a bit, when I could, on nights and weekends. And I think around March, 2020, my time at that other company come to an end that I thought, “You know, I’m really passionate about this, I’ve been spending so long working on it. Let me try to see where it can go.” So-

Deniz Kural:
So, it’s not the pandemic? It’s not because of COVID, you had more people at home being…

Bharath Ramsundar:
It’s a good choice. I think there’s a kind of confounding variable there, of course, with the pandemic. I think that has had a big effect in terms of people coming online. I’ll say, the pandemic triggered this amazing wave of interest in computational drug discovery. But, I think one challenge, still, is that like… What we saw was that there was this spike of interest that lasted about a month. And then, after that, it died down a little bit because the reality is it’s hard work, I think, in order to be able to learn these skills. And it’s hard to learn enough in a month to have a meaningful impact, but we did pick up a few. I think there is a secondary kind of factor there. Yeah, of course, both those things, I think, are probably more the pandemic, who knows.

Deniz Kural:
Thank you.

Bharath Ramsundar:
So, I think one of the things that we maintain as part of DeepChem is we have an extensive collection of community driven tutorials. These all run on Google Colab. I think there’s something like now, 50 tutorials that talk you through using DeepChem to solve problems in practice. So if you’re doing things like… Well, I think we have the deepest support for really small molecules. So, if you’re designing small molecules, generating small molecules, looking at synthetic feasibility of small molecules, there’s a tutorial in here for you. So there’s a very broad range of useful tools and, off-the-record, I know that several companies have gotten started from people working with some of these tutorials and resources and using that to build a first version of their assistance. I think I’ll also mention, I’ll jump ahead briefly.

Bharath Ramsundar:
We have a book with O’Reilly where it’s called Deep Learning for Life Sciences. This provides a more book length introduction to working with some of the tools that are available in DeepChem. We are slowly working towards trying to get an updated version of this book out there. I think there’s, of course, been a lot of fundamental advances in machine learning over the last… For the life sciences, AlphaFold 2, among other things have really, I think, changed the face of what it means to do machine learning in biology. So I think we’re trying to see if we can update our resources to reflect more of that. But I will mention the tutorials are free, on the internet. The book is, we don’t set prices, I think it’s something like $50. Most of the material in the book is covered through open tutorials as well. So, if you’re feeling thrifty, I definitely recommend checking out the open resources. But if you would like to donate to our coffee fund. We don’t make much money from this, but it does pay for my tea shop habit. Probably it’s about the limitedness, please feel free to buy a version of the book.

Bharath Ramsundar:
And I’ll briefly just mention a little bit about history. So DeepChem is a project that’s been around, I think, seven years at this point we’ve gone through a lot of evolution. We started Life… The first version, I think, ran on Theano. Which, for those who’ve been doing machine learning a while, it’s a very cool framework that’s so longer quite around and we’ve evolved through TensorFlow and increasingly, in our latest release, we’ve swapped over to using PyTorch as the full main backend. But we also have Jax support as something people are very interested in.

Bharath Ramsundar:
So, I think part of what we’ve been trying to do in the last year at TKM is really production stabilize it. So we spent a lot of work effectively hammering down weird, edge-case bugs. And I can attest DeepChem is quite stable and safe to use in production settings where I’ve been using in production settings for some time. So, if you’re a big company, like Upside, that would like to actually use DeepChem in a production workflow, I think it’s quite stable and ready to use. And I’ll say a bit more about this at the end of the talk.

Bharath Ramsundar:
Okay, so that’s just kind of a rough, high level introduction to what DeepChem is and what we do. But I thought it might be fun just to talk about some recent science that we’ve done through DeepChem and through collaborations spurred by the open community around DeepChemp. But, before I kind of go into talking about some more recent scientific results, I’ll just pause there in case anyone has any more questions or about DeepChem itself and the open community around it

Bharath Ramsundar:
Going once going twice. Awesome. So let’s talk about some science. So, I think that there’s a number of open questions around doing machine learning on molecules that it really pulled in a lot of attention over the last couple years. One of the big ones, of course, has been using generative methods to design new molecules. This is of a lot of interest, I think, to people working on small molecule drug discovery, among other things. If you could have the computer kind of dream up a molecule, and that could potentially be a powerful source of finding innovation that you or your scientists may not have thought of. Now, there are a lot of challenges to doing something like this, and I’ll say a bit more about them, but this was a project that grew out of just some open source hacking.

Bharath Ramsundar:
One of our contributors, Nathan was finishing up his PhD and he had gotten excited in working on small molecule generative methods. But, you know, it wasn’t what his thesis was focused on so he ended up working with us a little bit over the summer and we ended up writing a paper together, actually, about some of the results that we achieved. So, core idea behind some of this early hacking that Nathan started was that there was a couple of new innovations that came out of both the machine learning and computational chemistry ecosystem that we were excited about. The first is Normalizing Flows, I think are a tool that really has been popularized by DeepMind, among others. It’s a way of transforming a simple probability distribution, like the little Gaussian you see on the lower left hand side into a very complex probability distribution, like the complex distribution you see… Sorry, on the left hand side for the Gaussian and the right hand side for the complex distribution. And Normalizing Flow is a type of invertible transformation that satisfies certain mathematical conditions.

Bharath Ramsundar:
For the mathematicians in the audience, it effectively ensures that the partition coefficient is one. Which means that you don’t have to mess around with some painful challenges around estimating partition coefficients. And this means that you can compute probabilities very nicely. The other computational chemistry advance was that some really exciting work, I think, by Mario Krenn out of Alán Aspuru-Guzik’s group was coming up with a new representation format for molecules as characters during this, called SELFIES. So one of the challenges in generating molecules is that typically the language that is commonly used to represent molecular structures is called smiles. And smiles all have the bad habit that you can write a syntactically valid smile string that is chemically meaningless. Like, you can have too many bonds, you can have valences that are broken.

Bharath Ramsundar:
So SELFIES took the step of trying to tie the grammar of the string closely to the structure of the molecule. So in theory, if something is a syntactically valid selfie string, it should be a molecule that can exist at least in principle. And I think the basic idea behind this paper is thinking, “well, what if we just glued these two methods together?” By using SELFIES, we can kind of ensure that we always come out.. If we can sample a valid string with a valid molecule, and then we can use Normalizing Flows as a way of learning the very complex distribution of a molecular space that is occupied by real world molecules. And use that to pull out complex molecular samples. And here’s a brief, top line table. Like, this very simple idea actually turns out to work quite well.

Bharath Ramsundar:
So there are kind of sophisticated methods out there for sampling molecular structures. A lot of the way these methods work is they assemble the molecule, atom by atom. So you can view the molecules as a graph, like nodes and edges. And you stick on a new atom, one at a time, into a place on the graph. And this tends to work well, but it tends to be quite complex to code up and maintain. And it also is very complex. Whereas our method has the approach just being like, “no, just take two methods that are already well known elsewhere, glue them together, and bam, you have something that’s almost as good as this state-of-the-art technique in terms of sampling valid, unique molecules.

Bharath Ramsundar:
And I think one of the things that we were able to do, because of the simplicity of this method, we were able to swap this, I think, into just a simple synthetic active learning pipeline. So the idea behind an active learning pipeline is you can have an automated scientist. Where you’re able to propose new experiments. In this case, new molecules to take a look at. Then you can use some downstream method of evaluating the goodness of these molecules. For the simple synthetic task we looked at, we just used the simple proxy metric, but in reality, this might be like an experimental or an assay evaluation or something else like that. And then you can take the results of this and then feed this back into the original sampler to generate your next set of hypotheses. So, at a very high level, this type of active learning loop is like the same process we as human scientists do when we’re trying to solve a problem.

Bharath Ramsundar:
But it’s trying to automate this into this iterative loop structure that, in theory, can be faster and more efficient than a human. In practice, I think I always keep a human in the loop when I’m doing any real science. Like, I don’t trust any algorithm enough to not actually have humans looking at it, but I think it can actually be a systematic way of coming up with ideas that maybe you might not have thought of. We all have blind spots. And I think the algorithm has blind spots too, but it has different blind spots than a human. So I think it’s a very powerful system to have in the room when you’re doing scientific discovery challenges. And here’s just some brief results from the paper and the money shot is probably D where it’s, sorry… the, the core result is probably in D where we show that, in a simple synthetic challenge, this active learning loop is able to, in three iterations, find the maximally interesting molecule on a small toy problem. Whereas a simple, random search takes, I think up to like 35 steps. Of experimental iterations to find the maximal interesting molecule.

Bharath Ramsundar:
So you can maybe see, in principle, where if you’re running, for you all, maybe you’re trying to design an antibody instead of 35 experimental iterations. If you’re able to do the same in three, I think things start to look very interesting. So I’ll say also, of course, this is kind of a toy, synthetic baseline. I think that we’ve been separately looking afterwards and ways of kind of fleshing out some of the science here. And I’ll also give a shout out to Connor Coley’s group where Nathan ended up working with, partly. After his time working with us and DeepChem. And I think they’ve been doing some amazing work on extending active learning into drug discovery and building out frameworks for making that more reliable.

Bharath Ramsundar:
So I think, and I’ll have a citation at the end where I actually point you to the paper that we wrote up about this, and then appeared in, I think, the Neuros machine learning for molecules workshop 2021, 2020, I would like to say. So, before I move on to the next paper here, just pause in case anyone has any questions about that last work.

Bharath Ramsundar:
Awesome. Going once. Great. So continuing the theme about molecular generation, which is another project that we ended up working on, and this was a collaboration with researchers at CMU and Julia Computing was looking at ways of generating new molecular structures using a new class of methods called score-based generative models. So here’s another kind of diagram like the one I showed earlier and I’ll try to explain the difference between score-based methods and Normalizing Flows. So, you might remember just a minute ago, how I said in Normalizing Flows, you use a mathematical trick to ensure that the partition coefficient, which is effectively… For the non-mathematicians in the room, if you have a probability distribution, all the probabilities have to add up to one. Otherwise, saying something as a probability of three makes no sense. It can be like 0.25, that makes sense but three, that not sensible.

Bharath Ramsundar:
So what you do, if you have a complex function that describes a probability of something, in the real world, oftentimes you might have this like awkward factor where when you add up all the possibilities that comes out to five instead of one. And you might say, “well, let me just divide everything through by five and call it a day.” And that five is what’s typically called Z, the partition coefficient. So, it turns out that estimating this value Z is extraordinarily complex in practice. It’s been a while since complexity theory class, but I would like to say it’s something like D-space complete or something horrendous like that. So Normalizing Flows use a clever mathematical trick to effectively ensure that you always maintain Z = 1. But this means that you have to constrain the type of transformations you can perform.

Bharath Ramsundar:
And, in practice, what that means is you need very deep networks before you can actually sample very complex distributions. Whereas with score-based models, you effectively apply a alternate trick where you say, “well instead of me trying to learn the probability distribution outright, what if I just try to match the gradient?” I try to say that, instead, let me effectively take the derivative of the log probability. And this turns out, if you do a little bit of math. If you can picture a function divided through by Z and you take the log of it, then you’re subtracting log of Z instead. Then you take derivative and that constant term vanishes. So, I apologize for doing math in words, without writing anything down, but high level idea: score-matching means that you can have more complex functions rather than the constrained functions of Normalizing Flow to model a sophisticated problem distribution.

Bharath Ramsundar:
And it side steps the difficulty with having to estimate this partition coefficient. So we were excited to see whether you could adapt these score based methods to generating molecules. And we also brought in our friend SELFIES that is just really cool events. So we were like, “okay, let’s just do it on SELFIES. So we don’t have to worry about some of these syntactic issues.” So when we started this project, as far as we were, a lot of score based methods that’d only been applied primarily to generating images and generating not even much to generating texts. But we’re curious to see whether we could apply it to generate molecules. And, effectively, what it turns out, was that we were able to get the system to generate valid molecules, but there were some interesting trade offs where we effectively found that there were some limitations, still, that prevented the molecules generated from being entirely useful.

Bharath Ramsundar:
Let me just pull up a little bit of a picture here. So, we were able to use these score-based methods to sample diverse and interesting looking molecules. But one of the things we found is that the variance of these methods was very broad. So, effectively, you couldn’t tightly constrain the molecule sampled to a region of space that was close to the ones you cared about. For example, you often might say, have a few sample molecules that are active against a particular target. Then you actually want to stay close to that region of chemical space. So you can look around it, but not say, look next town over. Instead, you want to look next block over as it were. And so, this is actually something that these score-based methods still struggle with. And I think this is an issue that other people in the literature have wrestled with where you have this sampling challenge where controlling how these methods learn is still an open question, machine learning.

Bharath Ramsundar:
So I think part of what was exciting about this project is, we weren’t able to get it to go somewhere interesting, but we also, at the end, when we wrote up this paper for our workshop, we did not have state-of-the-art results. Which I think is also one of the cool things where, sometimes you have a science that’s in the fly. So we have continued to working on this project and I think we have better results, but don’t yet have something that is quite state-of-the-art. It’s kind of raw science. But, again, I think that’s the fun part of a global effort like this, where, again, I actually have not met any of my co-authors on this paper. Actually, on any of the papers I mentioned, I’ve never met them in person. But it’s all mediated through this decentralized, distributed community that’s DeepChem. So I think with that, I’ll swap over discussing a pair of papers that I’m personally very excited about. But before I pause, there and move on, any other questions about generative methods, either Normalizing Flows or score-based methods?

Bharath Ramsundar:
Awesome, okay. Now let’s learn about some large scale pre-training. So one challenge that I’ve seen working in machine learning on molecules or, really, other scientific applications, is that there’s never that much data. And I’m sure you all have faced similar challenges where, if you’re looking at scientific data, generating real data sets is costly. You need to run real experiments which means real reagents, real materials. It’s not like the world in tech where annotating image is something that can be done at very rapid, bolt scale. Or you can fool your users into clicking your CAPTCHA images and doing your annotation for you. So this means that, in doing scientific machine learning, we’re always at a low data realm where there’s always going to be less data than we would like for building out systems.

Bharath Ramsundar:
So one of the interesting things that other folks and us start noting a few years ago, there is much larger unlabeled data sets where, you say, have raw chemical structures. Or raw protein sequences or genomic sequences than you know information about actual, experimentally measured quantities with them. So if you could somehow leverage unsupervised or self-supervised learning to learn from this raw data without experimental labels and learn something about the structure of small molecules or the structure of protein or antibody sequences. There’s some potentially really interesting things you can do here. So, for us, we were interested particularly in small molecules so we started looking at applying basic transformers just to work directly off SMILES representation, in this case, of small molecules.

Bharath Ramsundar:
I’ll briefly mention we did try SELFIES, but in this particular use case, it doesn’t seem to make a big difference. The transformer seems to learn about it fine either way and the SMILES are actually faster to run at scale. So there’s a little bit of a difference with the two previous works I mentioned. And the basic idea here is that, what if you just train a standard transformer model, using the same methods that people use in natural language crossing, on these small molecule data sets. And the core idea for how we train a transformer model on these data sets is we do what’s called mass language modeling. Effectively, what you do is, in a mass language model, if you have say, a sentence, you might mask out certain words in the sentence and then train the model to infill the missing words.

Bharath Ramsundar:
For a molecule, what this means is you effectively mask out certain atoms, or bonds, in the molecule. And you ask the algorithm to learn how to predict the missing atom or molecule. Imagine if you have a molecule and you pull out a carbon, you’re like, “okay, I have a blank that has four bonds coming out of it. What could it be?” There’s really, basically only one answer if you’re looking at organic molecules. So learning to solve these types of challenges, I think teaches these models some basic understanding of chemistry. So, what we did effectively was we, in this first paper from a couple years ago, we trained I believe, on a set of 10,000,000 molecules using this mass language modeling objective. Then we took the resulting model and we fine tuned it on some downstream benchmark data sets from the molecule, that benchmark suite.

Bharath Ramsundar:
And what we found was a little bit underwhelming, a basic random forest pretty consistently outperformed this fine-tuned transformer. So, on the face of it, these results were like, “Well, why are you bothering to train an expensive transformer when a simple random forest can do better?” But we did snote this very interesting, early scaling law. We did Appalachian studies where we trained on subsets of the data and the pattern we picked up was, as we started training on larger collections of molecules starting, in this case, I think from a 100,000 up to 10,000,000, we saw steady increases in downstream performance on tasks. And then mentally, we said “well, what if we continue drawing the line further out?” Can this reach a point where this actually can outperform standard methods for predicting the properties of molecules? And, for those of who followed advancements elsewhere in technology, you’ll know this is, you’ll get GPT-3 or other things OpenAIs going, if you’d get one of these transformers big enough, they start doing very magical things.

Bharath Ramsundar:
So, with this study, I think we ended up at the 10,000,000. With the student who was leading it, we did a lot of this on Google Colab, so we left it there. But we did also take a little bit of a look at things like attention. Where we were able to pull out estimates using the transformer of which regions of a molecule were contributing the most to downstream predictions of its properties. So I think, the nice thing that this paper were like, we learned a little bit about the scaling law. And were also able to pull out some ways of analyzing what portions of molecules contributed meaningfully to downstream performance. And I think, with that, we wrapped that project up. But we did actually extend this work out into a sequel that we called Converta two.

Bharath Ramsundar:
So this project picked up right where we left off with the previous one where we started asking, “well, can we go one step larger with how we train these things?” And we also experimented, I think, with another self supervised task. So, previously, I mentioned this mass language modeling task where you block out certain tokens and have infill them. But another task that we started using in this version is something called multitask regression. And this has been proposed a few places before in the literature. The idea is that for a molecule, you actually know a bag of basic properties divided already. You might know its molecular weight. You might know the number of rotatable bonds that it has. And these can be computed very simply with the basic algorithms. So what if you trained in the transformative model, to learn to predict the molecular rate, given the strength. This actually turns out to be actually a useful task.

Bharath Ramsundar:
It teaches it to learn some basic knowledge about what the components of the molecule are and what contributes what amount of mass. And, if you train the transformers this way, I think this actually turns out to be quite a useful self supervised objective. And this is a very rough diagram of the architecture that we built out. The idea here was that we used the mass language model pre-training and also the multitask regression pre-training. And then we fine tune these on downstream tasks. One thing that, one trick we found… I’m not sure if I actually have a slide for this but I’ll just mention it. One trick we found was that training the multitask regression models was quite expensive. I think it was something like multiple, like a week or two on A-100 to get a multitask regression model training.

Bharath Ramsundar:
So we couldn’t really do much hyper parameter training on the multitask regression. But we did a correlation study where we found that hyper parameters that worked well for the mass language modeling challenge also tended to work well for the multitask regression. So we were able to use the mass language modeling as a proxy task for selecting hyper parameters, which we then use on the multitask regression. And there were a couple other tricks like this that we ended up using, I think we discussed in the paper. But training large transformer models is definitely a work of art these days. It’s not something where it’s really out the box. It’s more like you have to learn the magic knobs to make these things actually behave. But when we got all the magic knobs behaving, I think one thing we found was that when we scaled out to larger collections of molecules and we added in this new, self supervised task we were able to get better results. We were able to now see that the converter-two models are on average, I would say able to outperform baselines on downstream tasks.

Bharath Ramsundar:
So you do see the scaling law kind of carrying out, but there are a few interesting points there. If you notice, sometimes it’s not the biggest model that actually does the best. So sometimes a model that was trained on 10,000,000 molecules or even 5,000,000, tends to out performed the model that was trained on the whole 77,000,000 that we considered in this paper. So there was a lot of mysteries around these models still yet. Like, in the last paper, we showed that there was an interesting scaling law, but this type of transformer-based. Free training didn’t really beat downstream methods. And now I think we have a way to achieve numerical results that are potentially more predictive than other methods out there. Like Chemprop or other standard baselines, but we find that there’s an instability in the training still.

Bharath Ramsundar:
I think, do not spoil, but we have not solved this yet. I think that there is still a lot of work being done. This is an active area of research for us. So converter-three is slowly in the works and we’re trying to get this wrapped up. But if I had to say, this is an area where I think that these large, we call it a chemical foundation model, I think is going to be fundamentally important in the science of these fields. And I think we have some more cool results, but I think we’ll not get anything that’s ready to share. But we’ll have some more results coming out of DeepChem and other collaborator groups, I think, over the coming months. So let me, before I transition to the next part of the talk, let me just pause there in case anyone has questions about chemical transformers.

Deniz Kural:
I guess I had a question, a more general question. So thank you for presenting a couple different frameworks and papers. And maybe your next section is already discussing this so, if it does, feel free to just move on. How does DeepChem, as an open source software framework, make it easier to do the previous three projects that you’ve outlined as opposed to just using bare bones PyTorch and, let’s say, the standard ML libraries for serializing and storing the interim neural nets and doing checkpoints and so on.

Bharath Ramsundar:
That is a great question. So I’d say that there’s two parts to this in that there’s DeepChem, the library. But there’s also DeepChem, the open source community of researchers. So I think all the collaborations that were listed above were done by people, I think, none of whom had ever met each other in person. And all these were catalyzed by people working in the broader framework of DeepChem. What I will say is that I think the first project has been fully integrated into DeepChem. So it’s now within DeepChem 2.6. The latter two projects are still active research so they’re actually not production released into DeepChem itself. So it’s more that these projects are, as we make advances, we then package it out into the DeepChem releases and release it more broadly to the community.

Bharath Ramsundar:
But it’s not the case that… I’d say we view DeepChem as a vehicle for taking some of these advances and spreading them out and making it easy for other people to use. But I think that DeepChem, as it stands today is, it’s still one limitation we have is that DeepChem is a powerful tool for packaging and making easy a model to use once it’s kind of baked. But I think it’s not a lower level library like PiTorch or Jax which focuses on making it easy to experiment with different models. So we do a lot of hurrah hacking in PiTorch and Jax ourselves and then we wrap that up into DeepChem in a nice, shrink wrapped unit that makes it easy for people to use. So I think the goal of DeepChem here, is to make it easy for people to use these advances as they come out. But it is DeepChem, as a framework, we’re trying to still evolve into making it a tool for scientific experimentation. As opposed to a tool for using some of these methods out there. I don’t know if that fully answers-

Deniz Kural:
Yeah, that does. Thank you.

Bharath Ramsundar:
I think and this maybe, actually ties a little bit into the future of DeepChem. So I think that, broadly, one of the big trends that problems we’ve been trying to solve with DeepChem is that how do we make this framework more composable and then how do we make it more sensible? And this has been a major engineering challenge. So I think I alluded to this briefly, but the Python ecosystem has changed choice of machine learning library dujure like, at least four times. Each of which has resulted in having to do a rewrite of a major portion of DeepChem. Now, this is a major challenge. Developing and maintaining a large, open source library in Python is not easy. There’s a reason Google and Facebook and some other bigger organizations, if you look at Reactor, TensorFlow or PiTorch, all of these are backed by Megacorp.

Bharath Ramsundar:
I think we’re actually one of the larger, more indie open source libraries out there. NumPy is also a notable example. They’ve managed to get some more broad industry funding for NumPy and Pandas, which is great. But I think for DeepChem, smaller project, I think Deep Forest and a couple other companies have contributed in, but we’re still trying to figure out how do we do the industrial scale engineering required to make this happen? And we have some cool ideas. I think, if you’re curious, join the discussion on our forums. So if you go to forum dot DeepChem dot IO, there’s several research posts about ways we can evolve the future of DeepChem and some ideas that have come out of discussions on the forums and among our community is like DeepChem, as a framework, has been broadly evolving beyond chemistry.

Bharath Ramsundar:
So we had, last year, a really smart summer student who actually worked on solving partial differential equations within DeepChem. So he’s added support for using what’s called a physics inspired mural net to solve some classes of PDEs using DeepChem. And for those of you who are kind of applied scientist, I think mural PDE solution is just an amazing field that’s blowing up. Solving high dimensional, partial differential equations has been very, very hard, I think, since the dawn of computing. It’s still very, very hard, but I think there’s now new tools that are making it easier to solve complex partial differential equations than ever before. And I think we’ll start to see these tools trickling out more broadly to the community over the coming years. And we have some early tutorials and tools in DeepChem and plans for improvement over the next several releases, but PDEs are really hard.

Bharath Ramsundar:
Another area where there’s been, I think, a big growth in DeepChem is people starting to use it for material science applications. So we have the small sub-community of people doing materials discovery with DeepChem and there’s a few tutorials and resources out there. And I think we’re starting to work on an early draft of the manuscript introducing the use of DeepChem for material science. And I’ll just mention very briefly, the community has broadly been building out tools towards broader science. So the new code name for all you separates is the deep science suite. Where there’s a suite of tools that are broadly applicable throughout the sciences. But again, I think a lot of our core efforts really work around small molecules and proteins and genomic sequences are always close to heart.

Bharath Ramsundar:
So I’ll just very briefly, I’ll come back to this guy in just a sec, but I’ll give a brief shout out. So the company that I run, called Deep Forest Sciences, we help people, among other things, use tools like pChem within industry. So we built this, I think, very cool framework called Chiron that it has powered, at its core, by DeepChem infrastructure that enables us to use some of the capabilities of the cloud, but also I think some really exciting new algorithms and structures we built up. So we’ve been partnering with companies, mostly in biotech, a couple of people doing things in energy and other related fields to do some really cool things here. So I think that part of our goal with the Chiron system, is to make it easier than ever for people trying to do scientific discovery to leverage some of these cutting edge tools.

Bharath Ramsundar:
And I think part of what I find really exciting about the spectrum between DeepChem, or really, Deep Forest Sciences is that, we’re able to have this range from the entirely open, purely community driven, to something that’s actually enterprise grade and can be used in like large drug discovery setting. And being able to span that spectrum and pick the right range. There are some customers that are not comfortable having anything being out and open and we’re like, “yeah, it’s fine.” That’s why we exist as Deep Forest Sciences. Then there are some, for example, we have some collaborators at National Labs that it’s quite hard to get that governmental bureaucracy through. But we’re like, “Oh, it’s open source.” And they’re like, “Great, we’ll download it off GitHub and give it a try.”

Bharath Ramsundar:
So that gives us a range of flexibility that is part of what makes us interesting. And just quickly to go back here, this is maybe a partial answer to your question, Deniz is about. How does this become more composable? This is a quite difficult software challenge. If you look at TensorFlow, TensorFlow 1, it was rigid but it did what it does really well. And then they tried to make TensorFlow 2 which… Well, it’s going on YouTube so I’ll be careful about what I say but I will say they faced the fundamental challenge of trying to change out the engine on a car that worked well. And the new car, it still runs, but everyone kind of moved to PiTorch. And I think that it’s because it’s very hard to take a system that actually works well and make it kind of broader and more capable.

Bharath Ramsundar:
So this is something we’ve been wrestling with in DeepChem, which is that we would like to enable… As you said, make it easier for people to do scientific discovery through DeepChem rather than just making DeepChem a vehicle for taking new algorithms out there. And we have some several iterations on prototypes. I think we’re starting to get better at this. But this is a very hard software engineering challenge that we’re trying to avoid the pitfalls that some large projects, in the past, have made here by. And, I think, one of the things we commit to is that we don’t want to break anyone’s code. So if you depend on DeepChem, DeepChem will be production stable. We have a very long deprecation cycle. So we’re trying to move in a measured fashion, but I think we are making steady progress. And I think, in a few years, this will be something that is a tool for discovery and innovation in addition to being a tool for taking discoveries out there.

Bharath Ramsundar:
But yeah, I think with that, it’s kind of jump. Going a little bit out of order today, but I thank you for inviting me. That’s about all the material I had prepared, but if anyone has questions or things they’d to discuss, happy to.

Deniz Kural:
Yeah. Thank you so much for the wonderful presentation. And it’s actually right on time. We have some more time for questions. So just as a reminder, if you have a question, you can either press the raised hand button at the bottom of your screen and then we’ll call on you, real time with audio. Or, if you prefer, you could just type any questions you have in the Q&A window and I’ll just read them out for you. And if you think of questions after today’s presentation, you can also always reach out to Bharath or myself directly. With that, I’ll just give people 30 seconds to see if we have any questions.

Bharath Ramsundar:
So, while we’re waiting, I’ll also mention that we have been trying to get better support for antibodies into DeepChem. I was hoping we could get a Google Summer of Code student this summer but we didn’t have any. We will have a number of excellent students working with us but no antibody student for the summer. So if you know any smart students who’d like to help us improve our antibody support. And we do proteins pretty well actually, but in general… I’ll just put a shout out there in case anyone wants a, not to get anyone in trouble, but an evening fun project that is all open source. Hell, We could use your help.

Deniz Kural:
So it looks like we don’t have any follow on questions. Once again, thank you to our speaker Bharath Ramsundar for presenting to us today. And thanks to everyone who also joined. I hope everybody have a great rest of your day and keep an eye out for future additions of our app site invites seminar series.

Bharath Ramsundar:
Awesome. And thank you again for inviting me, Deniz, and thank you everyone for listening along and feel free to reach out to me and Deniz, as you mentioned, anytime if you have questions.

Deniz Kural:
Thank you, Bharath, bye.

Related

Related Resources