Paper Review - DeepBind

In which I record my thoughts on DeepBind.

What is this paper about? #

The authors of this paper designed a conv net model to predict how well different proteins will bind to sequences of DNA or RNA. They train one model per protein for many different sequences and show that these models can predict binding affinities for sequences well enough to produce insights regarding the impact of single nucleotide mutations. They then discuss different datasets and areas in which they tested their model and how it performed (often SOTA [as of 2015]).

Why is this important? #

At the object level, being able to better predict protein binding affinity to DNA/RNA sequences can help us better understand gene expression and identify mutations that will degrade functioning. At the meta level, this paper (claims to) showcase the power of convolutional neural nets to “automatically” learn general models of protein binding from heterogeneous inputs. The authors highlight how using a CNN obviates the need for lots of manual hand-tuning, although I suspect, like in many deep learning papers, they’re failing to mention other tuning they did to get this model to work as well as it does. The authors point out that, if these sorts of NN models prove useful in bio, future research can essentially ride the curve of deep learning model improvement that they expect to result from the increased investment by industry and academia in deep learning.

Technical Methods #

Their conv net model has 4 layers: a convolutional layer, a rectification layer, a max/avg pooling layer, and one or two neuron layers depending on whic performs better on the validation set.

Each model is trained to determine the binding affinity (or binding confidence in the classification case) for one protein. Their models take a DNA or RNA sequence as input and output a real-valued binding affinity score that represents the likelihood of that protein binding to this sequence (note: I still don’t understand what exactly this score represents). Specifically, they convert a sequence of $ m $ characters chosen from {A, C, T, G} (in the case of DNA) into a padding $m \times 4$-length vector where each row’s values sum to 1 and a $1$ at position $j,k$ indicates the existence of nucleotide $k$ at position $j$.

During training, they use dropout on both NN layers (when applicable) and use random search to find good hyperparameters. Since their models actually can function as classifiers or predictors, they use different loss functions to train each – negative log likelihood for classification and MSE for prediction.

Criticisms #

To Hide Or Not To Hide #

I’m not sure how common this is, but the decision to selectively use the hidden ReLu layer depending on validation performance felt unmotivated to me. I’d be interested in learning more about the intuition behind why removing it sometimes improves performance.

Never Roll Your Own… Models #

Their model is “built from the ground up in C++ and Python, with only low-level dependencies (CUDA and Numpy).” In fairness, this paper was written in 2015, so the Tensorflow paper had either just dropped or not even appeared yet. Unlike some reviewers, I don’t expect the authors of papers to time travel, so maybe this just shows how fast things progress in ML these days that I now expect a paper like this to use off-the-shelf components.

Regardless, someone has since created a Keras version of DeepBind.

Speculative Future Extensions #

Questions #