A Unified Architecture for Natural Language Processing

Fri 24 August 2018

A recent focus of the research in our lab has been to re-imagine how techniques in natural language processing can be transferred to biological sequences. As a result, I've been reading a lot more NLP papers, and in each I almost always see this paper by Collobert & Weston get cited as an inspiration. This post will be a Yobibyte-style overview of this paper by Ronan Collobert and Jason Weston A Unified Architecture for Natural Language Processing. The paper, which recently won the test of time award at ICML 2018, was one of the first examples to show that combining unsupervised learning for word-based features can be a powerful regularizer for downstream tasks in NLP, and provide a performance boost as well. The paper motivated a number of later lines of work, tying together multi-task learning, embedding and feature learning. It's a good place to start if, like me, you have some familiarity with neural networks but are new to NLP literature.

What and Why

From the intro for this paper, at the time of writing, models for solving NLP tasks (part-of-speech tagging, chunking, parsing, word-sense disambiguation, named entity recognition, ...) typically followed a recipe of ingesting sentences, representing these as features, and labeling words or sets of words. Each task was addressed separately by learning a vast set of hand-engineered features, then fed into a linear classifier (e.g SVM with linear kernel). This is undesirable for a number of reasons, chief among them are that features are task specific, and are discovered or validated by trial and error. The authors insight was to instead define one model that learned low-level language features in an unsupervised manner, which were shared among sub-networks that were able to flexibly refine and combine them optimally for each task. Each task-specific sub-network could then build upon these, and futher refine them into task-specific predictors using task-specific labeled data. Collobert & Weston were not the first to propose unsupervised feature learning for NLP, or multi-task learning, and there is more historical context in the paper. But they did it very well, and wrote about it very clearly. Apart from their results (which were considered strong at the time), the clear exposition of their ideas holds the enduring value for this paper.

In many real-world applications, a lack of labeled data imposes practical limitations on the class of models you can use, and therefore on the performance and generalizability of your solution. Since many tasks in NLP share information about word composition, ordering, or structure, using a language model to represent words and sentences as features in an unsupervised way should allow you to maximize the efficiency of your task-specific labeled data, by refining those features in a task specific way. That's the hypothesis the authors set out to test.

The six tasks they considered were:

  • Part-of-speech tagging (POS) labels each word in a sentence with a unique tag identifying syntactic role in the sentence
  • Chunking labels segements of the sentence with as beloning to a particular class of phrases, such as noun phrase, verb phrase (NP, VP). Words within a phrase may be labeled positionally, as the beginning of a chunk, or the inside of a chunk.
  • Named Entity Recognition (NER) labels atomic elements of the sentence into categories of words.
  • Semantic Role Labeling (SRL) labels syntactic constituents of sentences with a semantic label. This seems like it can take multiple forms, but in the one described by the authors, they use a formalism from PropBank, wherein words are agruments of a predicate in the sentence. For example, the sentence "John ate the apple" might be labeled as "[John]ARG 0 [ate]REL [the apple]ARG 1" to indicate 'ate' is the predicate, and both 'John' and 'the apple' are arguments of this predicate. In more complex sentences with multiple verbs, words may have multiple labels.
  • Language model A language model of a sentence estimates the probability of the next word of the sentence, given some context of observed words. The authors don't pose this problem exactly, but rather transform the prediction over words into a classification of true sentences versus artificially generated texts (like the negative sampling approach later taken by word2vec).
  • Semantically Related Words This task predits if words in the sentence are synonyms, holonyms, hypernyms. They used WordNet for labeling words.

The authors consider SRL to the the most difficult of the tasks. In their experiments, they use the other tasks to demonstrate their architecture works well for each task, and to demonstrate the level of improvement on SRL is achieved by solving the various tasks together.

How

Their network architecture is summarized in Figure 1 of the paper. I won't reproduce it here, since it's technically covered by ACM copyright, and I don't want to bother asking permission to reproduce it. The input to the network are sentences, and the output are task specific class labels, either for each word, parts of the sentence, or the entire sentence.

Embedding words

Each word is indexed into a dictionary of words $D \in \mathbb{N}$ that associates a (learned) vector in some $d$-dimensional space via a lookup-table $$ LT_{W}(i) = W_i $$ where $W \in \mathbb{R}^{d \times |D|}$ is a matrix of word vectors, $i$ indexing each word. So a given input sentence $\{s_1,s_2,s_3,\dots \}$ are transformed in to a sequence of vectors $ \{ W_{s1}, W_{s2}, W_{s3}, \dots \}$. In practice the authors decomposed each word $\mathbf{i}$ into $K$ components $(i_1,i_2,\dots,i_K)$, and used an embedding layer for each, so that each word $\mathbf{i}$ was embedded into a $d = \sum_{k=1}^{K} d^k$ dimensional space through concatenating the vectors of all components: $$LT_{W_1,W_2,\dots,W_K}(\mathbf{i}) = \left( LT_{W_1}(i_1), LT_{W_2}(i_2), \dots, LT_{W_K}(i_K) \right)$$ It isn't entirely clear how this word-level decomposition was defined in practice, but they suggest that was done for the SRL task, with additional lookup tables for each word depending on the distance of each word to the predicate.

Variable length sentences

While each word is embedded into the same tuple of vector spaces, the variation in sentence word lengths causes problems for feed forward neural networks. Their solution was to use Time-Delay Neural Networks (TDNNs), where time is interpreted as relative word position in the sequence. A TDNN layer convolves its input sequence $\mathbf{x(\cdot)}$ word by word with learned parameters $\mathbf{L}$, outputting another sequence $\mathbf{o}$ such that at time $t$: $$ \mathbf{o}(t) := \sum_{j=1-t}^{n-t}\mathbf{L}_j \cdot \mathbf{x}_{t+j} $$ where $\mathbf{L}_j \in \mathbb{R}^{n_h \times d}$ are the parameters of the layer (with $n_hu$ hidden units), and a kernel width constraint such that $$ \forall \, |j| > (ksz - 1) / 2, \; \mathbf{L}_j = \mathbf{0} $$

On top of this one-dimensional TDNN convolutional layer (possibly layers, they can be stacked), they stack a max layer which takes the maximum activation over time sentence-wide for each of the $n_{hu}$ hidden units. This makes the dimensionality of the max layer output dependent only on the number of hidden units in the layer, rather than the number of words in the sentence. Subsequent layers can thus be any neural network layer that consumes a fixed-sized vectorial input.

Task-specific additional layers, multi-tasking, instance level classification

The TDNN layer performs a linear operation over the input words. For POS and NER, this was sufficient, but for SRL the authors needed to stack further layers and allow for element-wise non-linear activations on the TDNN output. The authors point out that for semantically related tasks, the features learned for one task may provide a benefit for the others. It is not hard to convince yourself that features learned to do parts of speech tagging may be useful for named-entity recognition, or sematic role labelling. The way they chose to instantiate multi-task learning via parameter sharing was to have much of the lookup table layer shared among all tasks. While some tasks had additional lookup table layers, all tasks shared the majority of the embedding (or lookup) layers. Each task got to take advantage of the task-wise improvement on each other related task by getting access to small refinements of the shared features. Training was done by considering task in sequence as follows:

  1. Select a task
  2. Select a random example for the task (ed: this was likely mini-batched)
  3. Update the NN parames for this task by gradient descent, including any shared parameters
  4. Repeat step 1 The final layer for each task was a softmax layer, with the exception of the language model task.

Evaluation

The authors used Sections 02-21 of PropBank version 1 (contains approximately 1 millionwords) for training, and seciton 23 for testing for the SRL experiments. POS and chunking tasks used Penn TreeBank. NER labeled data was obtained by running the Stanford Named Entity Recognizer over the same data for SRL. The authors discarded anything with non-ASCII characters, and transformed accentuated characters to the non-accentuated equivalent. All tasks used a vocabulary of the 30,000 most common words from Wikipedia (which was used to train the language model task).

The authors report their results in the contenxt of improvement on the SRL task based on the word error measure. They measure and report both the effect of increasing the dimensionality of the word embedding table, as well as the effect on SRL test set error by training that task in tandem with another related task. I won't reporduce Table 2 or Figure 3 here, but will offer a few observations:

  1. All multi-task trained models (that is SRL plus one or more additional tasks) performed better than SRL alone.
  2. The language model offered the single best boost in performance on the SRL task.
  3. Keeping the training tasks constant, increasing the word embedding dimension size had either no effect or a mild negative effect on test word error rate performance.
  4. SRL plus the language model had the best performance, but addition additional tasks on top of this did not help in any significant way.

Comments

Having read this paper, many different elements that were present in other NLP-releated works I have read in the past are now much more clear (e.g approximate inference in language models a la word2vec). It's also instructive to think about how multi-task learning can be deployed, especially in terms of how the parameters are shared. Here the authors chose to have one shared embedding for words, but they could just as easily had group-specific sets of shared features where it makes sense to do so. For example, the multi-task model (SRL + LM) was only narrowly better than (SRL + POS + Chunking + NER + Language model), which strikes me as odd. Information about labeled parts of speech should help with semantic role labeling, but the experiments did not show any evidence of this. This might be due to additional low-dimensional positional embedding features for each word that was part of the SRL task model, though the authors did not report results for an SRL model without these features.

There are plenty of analogs to our own work on ChIP-seq embedding combined with accessibility features from ATAC-seq, but more on this later. For now, I've got more NLP papers to read, digest, and blog about.