Failure to reimplement: on the future-proofing of research papers

Thu 25 October 2018

Recently I've been reading papers from NLP, triaging those ideas which offer the best balanced return (novelty, applicability for my data) for time investment, and testing them out on ATAC-seq data. For those which don't offer a ready-made implementation available, I'm left with having to roll my own. This is a much more involved process than git clone, and after this last experience I'll tell you about now, I'm not sure it's worth the bother.

There are too many ways in which a re-implementation can deviate from the published method, especially if the source of the errors can be confounded between data set and model. In this latest attempt, wherein I attempted to re-implement dynamic k-max pooling. The idea is a good one; instead of keeping just the maximum value of activations in a sequence $p$, instead keep the in-order $k$ maximum values $p_{max}^k$. This is invariant to the distances between the values, and could be used to detect multiple common subsequence patterns in a longer sequence (think motifs within accessible chromatin). The original paper by Kalchbrenner, Grefenstette & Blunsom describes the method quite clearly, and a follow-up paper by Denil et al further simplifies the model. Both demonstrate its ability to summarize natural language sentences by evaluating its performance on a task of classifying the sentiment of tweets. The data set, which stems from a 2009 paper, scraped and curated 1.6m tweets for multi-class sentiment analysis. Fair enough, I thought, I can get the data, process it just as they have, code up dynamic k-max pooling, and then see if my results approach their own.

Two weeks later, I do not have a good model for classifying the sentiment of tweets. But worse yet, I don't know if my model is wrong (which I care about), or my pre-processing of the Twitter sentiment data set is preventing my model from succeeding (which I don't care about).

I am far from the first person to try and analyze this data set. Multiple papers or blog posts describe how they prepare this data set, but few of them agree on how it should be done. Only one describes what they do in sufficient detail to be reproducible. Below is a table describing how four different attempts set about pre-processing the data to arrive at a vocabulary:

Method Pre-processing steps Size of the vocabulary
Go et al. 2009 Usernames to common token, URLs to common token, no character runs longer than two (e.g huuuuuge -> huuge) 364464
Kalchbrenner et al. 2014 Go et al. + lowercase all tokens 76643
Denil et al. 2014 Presumably Kalchbrenner et al. 2014 None provided, presumably 76643
Kim 2017 Too many steps to list, but the pipeline seems sensible, and is available (with code) here 264936
Zamparo 2018 Too many steps to list, but code available here 277990

The variation in the vocab size is striking. The number of parameters in each of the models which consume the preprocessed data as input is dominated by the representation of the words in the vocabulary: typically these are 60 dimensional vectors per word (for Denil et al, Kalchbrenner et al).

I feel confident that my DCNN model should produce output much like Denil et al 2014,, but my model does not achieve good performance on the test set. Am I wrong about the correctness of my model specification? My hyperparameter instantiation? Or a data pre-processing error? I'll never know. There are too many cases where the descriptions of model, data or code are insufficiently precise for me to re-create the model and compare my results against those reported in the papers.

I'm far from the first to say that we as scientists need to be more careful about communicating our findings in ways that are more easily understood and replicable. But let me here add my voice to that chorus. For more concrete principles and practices to follow, I suggest checking out Gäel Varoquaux's blog on the subject. Written in 2015, it has aged quite well.