Paper Review - DeepSEA

In which I record my thoughts on DeepSEA.

Terminology #

ChIP-seq (Chromatin Immunoprecipitation Sequencing):
Expression Quantitative Trait Locis (eQTLs)
Cofactor Binding Sequences
Histone Marks: modifications to histone proteins in the nucleosome that impact the shape of chromatin.
gkm-SVMs: The gkm-SVM is the previous SOTA model for predicting transcription factor binding based on ChIP-seq data.
Allele: variants of a gene at the same position on a chromosome.

What is this paper about? #

This paper trains a CNN to predict the presence of 919 “chromatin features”–different TF binding sites, DHSs, and histone marks–from 1000bp DNA sequences. It then tests this model by using a functional significance score based on its output to train a logistic regression classifier to predict whether SNPs will be present in a few different catalog of SNPs known to impact different biological functions, e.g. a GWAS catalog of disease-related SNPs.

Technical Methods #

Why is this important? #

Meta #

This is the clearest of the three papers I’ve read so far, but that may be because I’m slowly internalizing the language of the domain.

Questions #

What does this mean? I guess this paper precedes DeepBind so maybe this was just the first of its kind…

Such approaches are valuable for prioritizing sequence variants; however, current methods except for a parallel work have not been able to extract and utilize regulatory sequence information de novo for noncoding-variant function prediction, which requires precise allele-specific prediction with single-nucleotide sensitivity. In fact, no previous approach predicts functional effects of noncoding variants from only genomic sequence, and no method has been demonstrated to predict with single-nucleotide sensitivity the effects of noncoding variants on transcription factor (TF) binding, DNA accessibility and histone marks of sequences.
What’s the difference between “motifs” and “evolutionary features and chromatin annotations”?
When they trained the logistic regression classifiers, what aspects of the original model were they using? I think I understand, you can use the output of the NN model as the input to the classifier and train the classifier to predict based on it (predicted impact on chromatin shape).
What do the probability predictions in the output represent? Are they continuous or purely for classification? Is the impact on a chromatin feature binary? Yes, the outputs are probabilities but the actual thing we care about is a binary classification (either will the factor bind or will there be a histone mark).
Is the histone mark on the nucleosome being predicted by the DNA sequence or is the DNA sequence content causing the histone mark? It’s believed to be the latter although there’s apparently some work claiming that the former can also happen.