Overcoming Catastrophic Forgetting During Domain Adaptation of Neural Machine Translation

Continued training is an effective method for domain adaptation in neural machine translation. However, in-domain gains from adaptation come at the expense of general-domain performance. In this work, we interpret the drop in general-domain performance as catastrophic forgetting of general-domain knowledge. To mitigate it, we adapt Elastic Weight Consolidation (EWC)—a machine learning method for learning a new task without forgetting previous tasks. Our method retains the majority of general-domain performance lost in continued training without degrading in-domain performance, outperforming the previous state-of-the-art. We also explore the full range of general-domain performance available when some in-domain degradation is acceptable.


Introduction
Neural Machine Translation (NMT) performs poorly without large training corpora (Koehn and Knowles, 2017).Domain adaptation is required when there is sufficient data in the desired language pair but insufficient data in the desired domain (the topic, genre, style or level of formality).This work focuses on the supervised domain adaptation problem where a small in-domain parallel corpus is available for training.Continued training (Luong and Manning, 2015;Sennrich et al., 2015) (also called fine-tuning), where a model is first trained on general-domain data and then domain adapted by training on in-domain data, is a popular approach in this setting as it leads to empirical improvements in the targeted domain.
One downside of continued training is that the adapted model's ability to translate generaldomain sentences is severely degraded during adaptation (Freitag and Al-Onaizan, 2016).We interpret this drop in general-domain performance as catastrophic forgetting (Goodfellow et al., 2013) of general-domain translation knowledge.Degradation of general-domain performance may be problematic when the domain adapted NMT system is used to translate text outside its target domain, which can happen if there is a mismatch between the data available for domain-specific training and the test data.Poor performance may also concern end users of these MT systems who are expecting good performance on 'easy' generic sentences. 1 Elastic Weight Consolidation (EWC) (Kirkpatrick et al., 2017) is a method for training neural networks to learn a new task without forgetting previously learned tasks.We extend EWC to continued training in NMT (see §3): Our first task is to translate general-domain sentences, and our second is to translate domainspecific sentences (without forgetting how to translate general-domain sentences).EWC works by adding a per-parameter regularizer, based on the Fisher Information matrix, while training on the second task.At a high level, the regularization term keeps parameters which are important to general-domain performance close to the initial general-domain model values during continued training, while allowing parameters less important to general-domain performance to adapt more aggressively to the in-domain data.
We show that when adapting general-domain models to the domain of patents, EWC can substantially improve the retention of general-domain performance (up to 18.1 BLEU) without degrading in-domain translation quality.Our proposed method outperforms the previous state-of-the-art method (Dakwale and Monz, 2017) at retaining general-domain performance while adapting to a new domain.
1 See Cadwell et al. (2018) and Porro Rodriguez et al. (2017) for discussions about lack of trust in MT.

Preprint Preprint 2 Related Work
A few prior studies address the drop in generaldomain NMT performance during continued training.Freitag and Al-Onaizan (2016) found that ensembling general-and in-domain models provides most of the in-domain gain from continued training while retaining most of the generaldomain performance.Ensembling doubles memory and computational requirements at translation time, which may be impractical for some applications and does not address our more fundamental goal of building a single model that is robust across domains.Chu et al. (2017) found that mixing general-domain data with the in-domain data used for continued training improved generaldomain performance of the resulting models, at the expense of training time.Dakwale and Monz (2017) share our goal of improving the general-domain performance of continued training.They introduce two novel approaches which use the initial, general-domain model to supervise the in-domain model during continued training.The first, multi-objective fine-tuning, which they denote MCL, trains the network with a joint objective of standard loglikelihood loss plus a second term based on knowledge distillation (Hinton et al., 2015;Kim and Rush, 2016) of the general-domain model.The second, multiple-output layer fine tuning, adds new parameters to the output layer during continued training that are specific to the new domain.They found both methods performed similarly, significantly outperforming ensembling in the more challenging case where domain shift is significant, so we select the simpler MCL as our baseline.
We do not assume that the domain of input sentences is known, thus we do not compare to methods such as LHUC (Vilar, 2018).Our work applies a regularization term to continued training, similar to Miceli Barone et al. ( 2017) and Khayrallah et al. ( 2018), but for the purpose of retaining generaldomain performance as opposed to improving indomain performance.

Method
Compared to Kirkpatrick et al. (2017), we present a more general derivation of EWC to address the fact that our tasks are not independent.We also show that the diagonal of the Fisher matrix used in EWC is intractable to compute for sequence-to-sequence models with large vocabularies.Instead we propose to approximate it with the diagonal of the empirical Fisher (Martens, 2014), which can be computed efficiently using gradients from back-propagation.
At a high level, our method works as follows: 1. Train on the general-domain data, resulting in parameters ✓G .2. Compute the diagonal of the empirical Fisher matrix F .Fi,i estimates how important the i th parameter ✓G i is to the general-domain translation task.3. Initialize parameters to ✓G and train on indomain data, using an EWC regularization term which incorporates the diagonal of F .
Intuitively, the regularization term during continued training keeps a parameter ✓ i close to corresponding general-domain parameter ✓G i if the model's general-domain performance is sensitive to that parameter (i.e., large Fi,i ).Parameters to which general-domain performance is less sensitive (i.e., small Fi,i ) are allowed to be updated more aggressively to fit the in-domain data.

Bayesian Rationale for EWC
For the following discussions, let X be the set of all well-formed source sentences and Y be the set of all possible sequences of target words.Training data D consists of translations (x, y).We assume x 2 X is drawn from a true underlying distribution of source sentences Q x , and y 2 Y is drawn from a true conditional distribution of correct translations Q y|x .Our model, parameterized by ✓, computes the conditional probability P y|x , P (y|x; ✓), which estimates Q y|x .Our dataset D is assumed to have come from two distinct tasks: general-domain translation with data D G and in-domain translation with data D S (domain-specific).Without loss of generality, Applying Bayes rule to log p(✓|D) and simplifying gives: We aim to maximize Equation 1 for ✓: ) as a multivariate Gaussian2 with mean ✓G , obtained by training the network on D G with standard negative log likelihood (NLL) loss, and diagonal precision matrix (inverse of the covariance matrix) given by the diagonal of the Fisher Information Matrix F : This is the expected variance of the likelihood function's gradient at ✓. 3 The magnitude of F i,i indicates the model's sensitivity to parameter ✓ i , on the general-domain translation task.Note that the first expectation is taken with respect to the true distribution of x and can be approximated by training samples.The second expectation is taken with respect to the model distribution P y|x , which is impractical for a large sequence-to-sequence model as it requires summing over all possible output sequences.
We approximate the true Fisher with the empirical Fisher F (Martens, 2014), where y is not enumerated but fixed to be the training labels: r log p(y|x, ✓)r log p(y|x, ✓) T Thus we approximate maximizing log p(✓|D G ) in Equation 2 by minimizing Note that the diagonal of F is easily computed from backpropagation gradients.

Approximating
Tasks are assumed to be independent in the original EWC work (Kirkpatrick et al., 2017)

EWC Loss
Combining the approximations above results in the EWC loss used in continued training: Where L S NLL (✓) is the standard NLL loss on D S and is a hyper-parameter which weights the importance of the general-domain task.Note that the left-hand side of Equation 3 is still the loss over both the general-and in-domain translation tasks, but the right-hand side is based only on indomain data.All information from the generaldomain data has been collapsed into the second term, which is in the form of a regularizer.

Experiments
Our general-domain training data is the concatenation of the parallel portions of the WMT17 news translation task (Bojar et al., 2017) and OpenSub-titles18 (Lison et al., 2018) corpora.For De$En and Ru$En, we use newstest2017 and the final 2500 lines of OpenSubtitles as our test set.We use newstest2016 and the penultimate 2500 lines of OpenSubtitles as the development set.For Zh$En, we use the final and penultimate 4000 lines of the UN portion of the WMT data and the final and penultimate 2500 lines of OpenSubtitles as our test and development sets, respectively.
We use the World Intellectual Property Organization (WIPO) COPPA-V2 corpus (Junczys-Dowmunt et al., 2016) as our in-domain dataset.The WIPO data consist of parallel sentences from international patent application abstracts.WIPO De$En data are large enough to train strong indomain systems (Thompson et al., 2018), so we truncate to 100k lines to simulate a more interesting domain adaptation scenario.
We reserve 3000 lines each for in-domain development and test sets.We apply the Moses tok-Preprint Preprint enizer (Koehn et al., 2007) and byte-pair encoding (BPE) (Sennrich et al., 2016).We train separate BPE models for the source and target languages, each with a vocabulary size of approximately 30k.BPE is trained on the out-of-domain corpus only and then applied to the training, development, and test data for both out-of-domain and in-domain datasets.Token counts for corpora are shown in Table 1.We implemented 5 both EWC and MCL in Sockeye (Hieber et al., 2017).To avoid floating point issues, we normalize the empirical Fisher diagonal to have a mean value of 1.0 instead of dividing by the number of sentences.For efficiency, we compute gradients for a batch of sentences prior to squaring and accumulating them.Fisher regularization is implemented as weight decay (towards ✓G ) in Adam (Kingma and Ba, 2014).
Preliminary experiments in Ru!En found no meaningful difference in general-domain or indomain performance when computing the diagonal of F on varying amounts of data ranging from 500k sentences to the full dataset.We also tried computing the diagonal of F on held-out data, as 5 github.com/thompsonb/sockeye_ewcthere is some evidence that estimating Fisher on held out data reduces overfitting in natural gradient descent (Pascanu and Bengio, 2013).However, we again found no meaningful differences.All results presented herein estimate the the diagonal of F on 500k training data sentences, which took less than an hour on a GTX 1080 Ti GPU.
We use a two-layer LSTM network with hidden unit size 512.The general-domain models are trained with a learning rate of 3E-4.We use dropout (0.1) on both RNN inputs and states.We compute lower-cased multi-bleu.perl.We use label smoothing (0.1) for all experiments except with MCL, because MCL explicitly regularizes the output distribution.
MCL uses an interpolation of the cross entropy between the output distribution of the model being trained and the general-domain models output distribution (scaled by ↵) and the standard training loss (scaled by 1 ↵).For MCL, we do a grid search over learning rates (10 4 , 10 5 , 10 6 ) and ↵ values of (0.1, 0.3, 0.5, 0.7, 0.9).For EWC, we do a grid search over the same learning rates and weight decay values of (10 2 , 10 3 , 10 4 , 10 5 ).

Results
We present the full in-and general-domain performance trade-off 6 for both EWC and MCL in Figure 1.This is computed by taking the convex hull of a grid search over learning rate and regularization amount for each method.EWC outperforms MCL at all operating points with the exception of Ru!En, where MCL provides a small in-domain performance improvement at lower general-domain performance; this was also observed in Khayrallah et al. (2018).Figure 2 shows an example result (for En!Ru) of the grid search prior to taking the convex hull.We see similar trends between the three pairs of MCL/EWC curves at corresponding learning rates, but in each case EWC is further up/right, indicating better performance.Note that for both EWC and MCL, both learning rate and regularization amount have a large impact on final in-and general-domain performance.
General-domain gains for no in-domain performance degradation are presented in Table 2. Our method provides large general-domain gains (between 8.0 and 18.1 BLEU), regaining the majority of general-domain performance lost in continued training and substantially outperforming MCL.
6 Previous work has compared single runs of competing methods, making comparison difficult (e.g. one system may be better on in-domain, the other better on general-domain).

Conclusion
We interpret the general-domain performance drop experienced during continued training as catastrophic forgetting of general-domain knowledge and demonstrate that it can be largely mitigated by applying Elastic Weight Consolidation.
We present the full trade-off for in-and generaldomain performance and show that our method outperforms MCL (Dakwale and Monz, 2017) at all operating points in five of six language pairs.Our method is able to regain the majority of the general-domain performance lost during continued training without compromising in-domain performance and without an additional memory or computational burden at translation-time.
Our method retains the advantages of continued training while addressing one of its main shortcomings and can be used in practical situations to avoid poor performance when general-domain input is encountered, even when in-domain performance and translation efficiency are both critical.

Figure 1 :
Figure 1: Performance trade-off for MCL and EWC: Convex hull of grid search over learning rate and regularization amount.x-axis is in-domain BLEU and y-axis is general-domain BLEU, so the desired operating point is the top right corner.Initial general-domain model (GD) and continued training (CT) points are shown for comparison.

Figure 2 :
Figure 2: En!Ru results for various learning rates, for both MCL and EWC.Regularization amount increases from left to right for each trace.General-domain and continued training points shown for reference.

Table 1 :
# English words in the training corpora.

Table 2 :
General-domain BLEU for: general-domain model prior to adaptation (GD), standard continued training (CT), and best performing MCL and EWC models with no in-domain degradation compared to CT (delta from CT). Best improvement over CT bolded.