Paper Review - Basset
In which I record my thoughts on Basset.
Bio Background #
The genome consists of (broadly) two types of genes, coding genes and noncoding genes. Coding genes get translated into proteins (as laid down by the Central Dogma) and are what most of us learned about in bio class. Non-coding genes… don’t. As I understand it, noncoding genes can do a bunch of different things, but part of their function is to regulate coding gene activity through a number of mechanisms.
Related to this, in the genomics world, there’s a concept of “DNA accessibility”. In eukaryotic cells in particular, DNA is intertwined into a material (or structure, not sure what the right term is here) called chromatin. This chromatin can take on different shapes, which impact which genes are accessed by transcription (become proteins). If I’m understanding correctly, certain noncoding genes can impact chromatin shape and thereby transcription.
What is this paper about? #
This paper focuses on predicting DNA sequence accessibility in different cell types with sequences as input and the output from DNase-seq as training data. They use the output of their trained model to re-discover sequences that are already known to correlate with cell-specific accessibility and use it to do what they call “in-silico saturation mutagenesis” - briefly, testing the impact of single nucleotide mutations on predicted accessibility.
Technical Methods #
The Basset model is a classifier that takes a one-hot encoded DNA sequence as input at test time and outputs accessibility classifications (accessible or not) for 164 different cell types. At training time, the model takes the sequence plus a binary vector of accessibility for the 164 different cell types.
Basset is a 6-layer convolutional neural network. 3 of the layers are convolutional layers and use max pooling, batch norm, and rectification. 2 subsequent layers use ReLu neurons with dropout. And a final sigmoid neuron layer has 164 neurons representing accessibility for the 164 different cell types.
The authors used Bayesian optimization for hyperparameter tuning during training.
They did something fairly interesting on the interpretability front, which I’m still trying to wrap my head around. (I’m not sure if this is common.) Based on the assumption that convolutional filters each would capture knowledge about accessibility-related sequence motifs, they nullified filters one by one and observed the change in outputs (by comparing the sum of squares of the net’s output before and after). They then clustered filters by their impact on accessibility predictions revealed a set of filters that matched known factors involved in epithethial development.
They also did something else cool where they converted filters to PWMs and then compared them to known transcription factor motifs using a tool called TomTom. This allowed them to see whether filters actually captured “real features” of the data. Combined with the previous method, this also enables them and others to look for potentially rare but important sequences that impact accessibility.
Why is this important? #
Through reading this paper, I think I got a better sense of what aspects of these models might actually be useful than I had when I read about DeepBind.
I was told this by my professor, but didn’t fully grok it until now - biologists care much less about prediction than ML people do and much more about interpretability. We see that in this paper - most of the predictions the model makes have already been validated in experimental studies. The authors discuss a few leads to follow, but I suspect it’s hard to convince other biologists to do the work to follow them.
On the other hand, the way the authors interrogated the model and found meaningful information inside of its filters is pretty exciting to me. It seems like comp. bio / ML people have an opportunity to explore interpreting neural nets in ways that make less sense in the world of image processing and NLP.
Related to this, I’m excited by and think this line of research is valuable for two primary reasons. First, to the extent it moves us towards a world where ML models can actually be used to guide/substitute for expensive experimental work to discover noncoding sequence/accessibility relationships, it promises to make genomics research more efficient. Second, I have this vision of ML being used to build better knowledge maps of biology that enable biologists to answer questions by actually probing models and using them almost like expert systems. The interpretability stuff in this paper seems like a step in that direction.
Meta #
So far, I’m frustrated by how bad the definitions and descriptions of biological methods / processes are in these papers. Maybe I don’t realize the same thing is true in CS, but I repeatedly find myself wanting to answer a simple question like “what is DNase-seq?” and, instead of getting a simple description, finding detailed discussions of the processes. As a Computer Scientist, I wish biologists described things in terms of abstractions more and mechanisms less. For example, why can’t any of these sources just tell me what the output number from DNase-seq (or any other process) means? I lost hours just trying to figure out what the output values of Basset’s NN represented and trying to understand what geneticists mean by “enrichment”.
In case anyone else has this problem, a tip is that I’ve ironically found that file formats and model inputs are easier to decipher than the prose descriptions in some cases.
Questions #
- Are DHSs DNA sequences or something else? (Tenative answer) Yes, I think at least in the context of this paper, we care about sequence specific DNase enzymes.
- Why is accessibility binary? Is it? No, it’s not. The NN seems to be outputting a continuous measure of accessibility. That said, I don’t know what the unit of chromatin accessibility actually is…