I’ve always been fascinated with the concept of discovery in biology. Perhaps this is the reason why I work with microscopy images as my data of choice. The whole reason we have the word “cell” is because scientists placed corks and plants under microscopes in the 16th century and observed structures like the cells in monasteries. The existence of bacteria was also discovered unexpectedly in this century, through a merchant improving the resolution of microscopes to look at the structure of cloth: these findings would eventually lead to development of germ theory. Even in the modern era, these principles of discovery through data exploration persist. Fluorescent microscopes let us observe that some proteins form structures around DNA during DNA damage and let us systematically identify proteins involved in the DNA damage response through this observation. Time-lapse microscopy let us observe dynamics over time, leading to the discovery of fascinating oscillatory pulsing behaviors where proteins shuttle between compartments rapidly. Clearly, the ability to take a biological sample and look at it directly, without prior knowledge of what might be present, is a powerful tool for discovery. It means that we can sometimes observe unexpected phenomena, that as we work to understand, deepen our knowledge of cell biology.
One of the things I think deeply about is how we can sustain the spirit of discovery in the era of big data. When high-throughput methods now generate thousands of images daily, how is it possible for us to look at each image individually for exploratory insights anymore? We need new tools that can cope with this large volume of data. For the past few years, I’ve been exploring deep learning for this purpose: if deep learning models can summarize image data in a smart enough way, we may be able to rely on the model’s impression of the data to identify interesting patterns or behaviors that merit follow-up.
Initially, at the start of my PhD, I was not interested in deep learning. At the time, the most popular application of deep learning was building classifiers. Based upon that limited impression, I could not see how deep learning would be useful for the questions of data exploration and discovery that I was interested in. In classification, models learn from annotated data provided by an expert. But this means that we could only expose models to examples where we were confident in our ability to annotate the biology. Moreover, the output of a classification model is a label selected from a series of predefined classes provided by the expert. But this assumes that we know about and can define all classes for the biology we are interested in studying from the very beginning. In other words, while classification models can automate the identification of images that we already know a lot about, they are not suited to categorize images and biology we know less about. I expected that the latter kind of images would be where the more interesting, novel insights would be concentrated. Even in the best attempts to manually annotate protein subcellular localization in the yeast proteome from images, nearly 10% of the images were too complex or unknown for experts to categorize.
The major break-through that changed my mind about deep learning was the idea of “self-supervised” learning. In contrast to supervised classification models, which learn by rote dictation from an expert, self-supervised learning models teach themselves about data through puzzle-solving. One advantage to self-supervised learning is that it makes it easier to train models, because the puzzle-solving tasks are designed to be posed autonomously. For example, one popular task is to determine how much an image has been rotated by. A preprocessing module takes images of the natural world and randomly rotates them, and then the model must determine the angle of rotation given the rotated image. The idea is that the model learns about the objects in the natural world, because to identify the rotation correctly, the model needs to develop an understanding of different objects and how they are usually orientated (e.g. the model needs to know what a cat is, and that it usually has its ears on the top of its head, in order to figure out if an image of a cat has been flipped upside down.)
But there was one further advantage of self-supervised learning that I was even more interested in. Unlike defining classes for supervised classification, we provide the model with no prior biological knowledge or assumptions. Because the puzzle-solving task is intended to be general, we can show the model all the data – including the complicated images that we have no idea how to categorize. With these principles in mind, the self-supervised model can be thought of as developing its own, independent impression of the data. Armed with its puzzle-solving task, the model autonomously explores and plays with the data to develop its understanding. We found that, equipped with the right puzzle-solving task, a self-supervised deep learning model develops an understanding of protein biology in fluorescent microscopy images that is highly consistent with how expert biologists categorize the images. But we also found cases where the model differed from experts. We found many cases where while an expert grouped together images into a single class, the self-supervised model would split the same images up into multiple distinct subclasses. When we took a closer look at these subclasses, we found that there were often functional differences between subclasses – suggesting that the model might have found novel subclasses that experts missed.
To me, the unbiased nature of self-supervised learning was a big deal. This project illustrated to me how deep learning could be deployed for discovery. By using self-supervised learning, deep learning models from independent perspectives of data. We could then contrast these perspectives with our current body of knowledge, to find areas where the model disagreed. Here, the model can provide targeted challenges to our prior conceptions of biology – potentially leading to a more nuanced or correct understanding of biology as we determine if these challenges were valid.
In other words, self-supervised learning had the potential to fuel discovery in big biological datasets. I was excited about this and thought this would be something that computer scientists working in the area would have already made a big deal about. But when I looked at how research in self-supervised learning was developing, very few papers even acknowledged how the lack of bias in self-supervised learning was useful. Rather, every single paper published in the mainstream machine learning conferences focused solely on evaluating and promoting self-supervised learning as a way to obtain high-performing models without labeled training data.
Upon reflection, this focus was not surprising. The assumption that our datasets will contain phenomena or classes beyond our current scope of understanding, is one that is more specific to biology. In the image datasets I work with, 10% – or more – of the data is so complex and poorly understood that we cannot reliably annotate it. In contrast, the natural image datasets used in mainstream computer vision do not have this problem. While there is some ambiguity around semantics – like whether to call a building a church or a cathedral – for the most part: cats are cats. Dogs are dogs. No wonder computer scientists did not appreciate how essential the lack of bias in self-supervised learning was to computational biology applications: their datasets were not made for discovery in the same way that our datasets are.
Overall, this experience has taught me that sometimes, the priorities we have in computational biology will not be major priorities in mainstream computer science. Naturally, this means that research in computer science may develop in directions different from what would best serve computational biology. In my specific example of self-supervised learning, dozens of different self-supervised puzzle-solving tasks have been proposed, ranging from colorizing black-and-white images, to predicting rotations, to more recently, contrastive methods that predict one part of an image from another. But because computer science primarily pursues optimizing overall performance on object classification, many of these methods have quickly been obsoleted without, in my opinion, fully understanding their properties. Intuitively, different methods seem to demand the models to pay attention to different cues to solve, and I wonder how this relates to the kinds of biological phenotypes these models will be more sensitive to. For example, colorization requires the model pay a lot of attention to edges, like coloring inside the lines of a picture book – does this mean that these models are particularly good for understanding proteins at the cell wall, or membrane proteins? Predicting rotations requires spatial attention – would these models be best suited to understanding cell polarity or polarity in embryogenesis? Unfortunately, these kinds of questions are unlikely to be addressed by the current work emerging from mainstream computer science.
More generally, I wonder how this disparity in research priorities in how mainstream computer science develops deep learning, affects the trajectory of methods development in computational biology. In computational biology, we often borrow and translate methods from mainstream computer science. When I look at the way deep learning has been applied in computational biology in the past decade, it becomes apparent to me that that our reliance – or over-reliance – on mainstream computer science for inspiration and methods has deeply affected the problems we have been able to solve through deep learning. Across various biological subdisciplines, we have fantastic neural network methods for classification – we can classify microscopy images , protein sequences, gene expression. We also have terrific methods for segmentation – segmenting histology images, or cells in microscopy images is easier than ever now. These are problems that have substantial overlap with the hotspots of deep learning research in mainstream computer vision – a lot of research is dedicated to solving classification datasets like ImageNet, and segmentation is critical to a lot of hot application areas like self-driving cars. However, there are also many problems more specific to or of greater concern for biological datasets, that we still don’t have good answers to yet: multi-modal models, out-of-sample generalization, data efficiency, the list goes on. For example, even state-of-the-art deep learning models struggle to perform well on data from new experiments performed a day different from than the experiments used to the train the model. For a long time, generalization to out-of-sample data was not a priority research area for deep learning, and while that seems to be changing due to new papers demonstrating how critical this is, the standard in the field is still to assume test data is drawn from the same distribution as training data.
But an even more major consequence of computational biology relying on advances in computer science too much for our understanding of deep learning, is that biologists have been given the false impression that some of the limitations in applying deep learning to their data are fundamental, inherent limitations of deep learning, rather than a reflection of the research priorities in computer science. For example, many reviews of deep learning for biology have claimed that a major weakness of deep learning is that it is data hungry and cannot be used on smaller datasets. But is this really an inherent limitation of deep learning, or has there just been relatively little attention given to developing the types of data efficient methods that we need in biology? In work I presented at the Learning Meaningful Representations of Life workshop at NeurIPS last year, I showed that certain classes of methods, such as generative self-supervised methods, seem to be vastly more data efficient than the supervised methods favored by the field, learning effective representations of biology with just a tenth of the data. Similarly, these reviews have claimed that deep learning models are black boxes that lack interpretability. But again, this may just be because there has been little research connecting deep learning architectures to biological mechanisms: early work, such as this paper showing that neural networks can be constructed to be equivalent to biophysical models of cis-regulation, seems to suggest that deep learning models can be designed to be inherently interpretable (as opposed to the post-hoc interpretability methods prevalent in computer science, like saliency maps).
In other words, deep learning has been marketed as not being applicable to certain biological datasets or purposes. But this is in part, a self-fulfilling prophecy: biologists are told not to explore deep learning for these types of data, or for these kinds of purposes, meaning that methods specific to biological data are relatively underexplored.
Overall, from my perspective, we have yet to realize the full potential of deep learning in biology – there are still so many problems potentially addressable by deep learning, for which we lack appropriate methods for. At least in imaging, while we have gotten very good at using deep learning to automate routine tasks, like labeling images, and preprocessing operations, like segmentation or noise correction, we still lack a good understanding of how we can apply deep learning to discover novel biological insights. Arguably – the latter is much more interesting than the former. To push out into these frontiers, biology needs to set its own agenda for deep learning: articulating outcomes and properties of methods specific to biology and identifying where these needs are not currently served by transferring methods in computer science.
Often, these distinctions can be subtle. For example, I’ve noticed that when referring to the need for “interpretable” models, biologists can have slightly different motivations compared to industrial applications. When developing machine learning models that guide self-driving cars, interpretation is used to confirm that the model is paying attention to sensible features: we want to make sure the AI is paying attention to the stop sign by the road, not stray pixels at the top-right of an image. But in biology, we are also interested in using interpretation to discover. If we have a model that performs well, and we can rule out that the model depends on batch effects and other technical issues for this performance, then we are next interested in understanding what biological features the model is using. It could be that the reason why the model outperforms classic predictors is because it has found a certain motif in a protein sequence, or a certain pattern of protein puncta in a microscopy image, that was beyond our current body of knowledge and was not accounted for in classic models. When we describe a need for “interpretable” models in biology, we are not just interested in confirming that the model is reliable, but also understanding how the model perceives and organizes biology. The methods we develop need to account for this additional demand.
In general, performing these conceptual exercises will help us better understand what our methodological needs are, and what kinds of questions we are interested in asking using deep learning. Importantly, we will learn areas that we are interested in, that are not research priorities in mainstream computer science. For these areas, we cannot afford to wait for methods to emerge in the mainstream. Rather, if we are to truly address these interesting questions of discovery – and beyond – that do not have strong analogues in mainstream computer science, we will need to develop them for ourselves.