In this blog post, we will delve into the details of recently proposed techniques for either generating or ensembling adapters for low-resource languages. This short post is my submission for the EMNLP 2021 PhD Course organized by the NLPnorth unit at the IT University of Copenhagen.

## Introduction

In recent work in zero-shot cross-lingual transfer, MMTs (massively multilingual transformers) such as mBERT or XLM-R represent the state-of-the-art, thanks to their impressive pretraining on more than 100 languages. They do however suffer from the so-called “curse of multilinguality” (Arivazhagan et al., Conneau et al. ), which means that a fixed-size MMT will decrease in performance as the number of covered language increases. In fact, exteding the language coverage of MMTs by full fine-tuning of all model parameters can lead to “catastrophic forgetting” of previously acquired knowledge.

One of the main ways to address this is to add new learnable weights called adapters in each transformer layer while keeping the original weights frozen during fine-tuning. Language-specific adapters, introduced in the MAD-X paper (Pfeiffer et al.), are infused with the knowledge of a language and used in combination with task-specific adapters, enabling zero-shot transfer of MMTs even for languages not seen during pretraining. This approach however introduces a few issues, in particular the need of one adapter per language and the inability to adapt to new languages at test time.

In this blog post, we will first briefly review the adapters and understand the limitations of the MAD-X architecture. Afterwards, we will discuss two papers published in EMNLP 2021 which try to address those issues by presenting lighter architectures that can easily adapt to new languages, either by learning to contextually generate adapter parameters (MAD-G, Ansell et al.) or by properly ensembling the output of the language adapters (EMEA, Wang et al.)

#### A brief recap

Adapters are an alternative to regular fine-tuning of PLMs (pretrained language models) and they were introduced to address two main problems that arise in transfer learning scenarios:

• Catastrophic Interference in multi-task learning: sharing parameters between tasks can produce a deterioration of performance for a subset of tasks
• Catastrophic Forgetting in sequential fine-tuning: this happens when the process of fine-tuning on tasks (partially) erases the knowledge acquired in previous stages

By inserting task-specific weights inside each transformer layer, and freezing all the other weights in the architecture, the previous knowledge acquired during pretraining stays intact during fine-tuning. In the picture below, you can see the regular transformer layer on the left and the architecture with the adapters on the right, in which only adapter weights $$\phi$$ are fine-tuned.

Adapter-based fine-tuning is much faster, due to the reduced number of learnable parameters, but they do make the overall architecture larger thus having longer prediction times at inference. There are however ways to tackle this by using for example AdapterDrop (Rücklé et al.), a technique that drops adapters at earlier layers at inference time. Adapters are also interchangeable which means that you can swap adapters provided that you keep the same underlying architecture.

In general, there are two main types of adapters, depending on the type of knowledge infused into them: task-specific adapters (TAs) and language-specific adapters (LAs). The combination of the two was explored in the MAD-X paper, which introduced the following procedure for zero-shot cross-lingual scenarios:

1. Language adapter training: Train the language adapter for the language of interest and the source language (for example English) using MLM training objective
2. Downstream task training : Stack a randomly initialized task adapter on top of the source language adapter. Then freeze all weights except those of the task adapter and train on data in the source language. You can see an example for NER training in the picture below.
3. Zero-shot transfer: Swap the source language adapter with the target language adapter and evaluate on the test set in the target language. In the picture below you can see the English adapter being replaced with the Quechuan one.

The MAD-X approach is particularly useful to adapt MMTs to languages not seen during pretraining, while achieving good performance also on high-resource languages. There are however some downsides in using this method for cross-lingual transfer, which I will highlight in the next section.

The issues with MAD-X can be summarized as follows:

1. one LA per language, which implies that (a) model size scales linearly with the number of languages and (b) monolingual data is needed for training the language adapter, which might not be available for truly low-resource languages.
2. Inability to adapt to unseen languages at test time
3. LAs do not interact with each other, for example to deal with unseen languages.

We now move to the description of two papers that try to tackle these issues by proposing different approaches.

The idea behind MAD-G (multilingual adapter generation) is to learn a model which can contextually generate language adapters for unseen languages, using language embeddings based on typological features. The architecture is the same as in MAD-X, but the adapter weights are predicted using CPG (Contextual Parameter Generation), a general procedure which is used to predict the parameters $$\theta$$ of a neural network $$f_{\theta}$$ conditioned on some context, which in this case is the typological information of the language of interest. Having only one model that can infer the language-specific parameters for any language makes the model lighter and quick to adapt to unseen languages. We will first delve into the details of the CPG procedure (assuming the training has already happened) and we will then move to the description of training and usage of the MAD-G models. The CPG phase is divided into two phases: language embedding generation and adapter parameter generation.

#### Language embedding generation

For a given language $$l$$, they extract a sparse typological vector $$\mathbf{t}^{(l)}$$ containing $$289$$ binary typological features (103 syntactic, 28 phono-logical and 158 phonetic features) from the URIEL language typology database. A language embedding $$\mathbf{\lambda}^{(l)}$$ is obtained via a down-projection, $$\mathbf{\lambda}^{(l)} = V \, \mathbf{t}^{(l)}$$. In prior work language embeddings were learned end-to-end and were thus unable to adapt to new languages introduced at test time. In this case however, the model can use the knowledge stored in the URIEL database to quickly adapt to a new language, addressing issue (2) raised above.

The language embedding provides the information needed to produce the language-specific parameterization $$\theta^{(l)}$$ of the model, using a simple learned linear projection $$\theta^{(l)} = W \, \mathbf{\lambda}^{(l)}$$. The matrix $$W$$ is the same for all the languages, thus enabling knowledge sharing across the languages, solving issue (3).

The picture visually explains the process of generating the adapter parameters.

### Training the whole architecture

We have seen how CPG can infer the language adapter parameters conditioned on an external linguistic source. But how does it learn to do so and how is the overall architecture used in downstream tasks? The authors describe the overall procedure as divided into the following 3 main steps, summarized in the picture below:

1. Training the CPG generator: this is done simply by minimizing a multilingual MLM objective, while keeping the transformer layers frozen.
2. Downstream task fine-tuning: a task-specific adapter is placed on top of the generated source language adapter. The latter together with the transformer layer stay frozen during training.
3. Zero-shot transfer: replace the source-language adapter with the generated target language adapter and proceed with inference.

### Results

The authors test the MAD-G approach on part-of-speech (POS) tagging and dependency parsing (DP), achieving the best results for “MAD-G-seen” languages, which are languages the CPG module was trained on and not included in the MMT pretraining. For unseen languages, MAD-G based models performed the best on languages belonging to the same genus of languages seen during CPG training. The main take away from the result was that it is difficult to improve the knowledge of the MMT of a language seen during pretraining, as the usage of MAD-G turned out not to be particularly beneficial.

An interesting point was raised by analyzing the multi-source transfer scenario, where the task-adapter in step 2 of the procedure above is trained with multilingual data. Large gains were observed by increasing the number of languages (up to 20) while maintaining the number of samples constant. As you can see in the picture below, the gain is particularly large when switching from 1 to 2 languages.

The authors were also able to prove the beneficial effect of having MAD-G initialised language embeddings over randomly initialized ones. To do this, they propose to initialise the adapters and then fine-tune them via MLM using monolingual data before the downstream task fine-tuning (step 2). This scenario is more realistic than having zero unannotated data, even for truly low-resource languages. They simulated different resource-poverty scenarios, by varying the amount of available data as in the graph below: as you can see the gap is particularly large for DP, and does not shrink even if larger amounts of data are available.

## Ensembling adapters for low-resource languages (Wang et. al.)

#### Naive and entropy minimized ensembling

In this paper by Wang et. al, the main idea is to try to adapt the MAD-X architecture to new languages at test time, without changing the architecture or introducing new adapters. The first thing that they empirically show is that using the language adapter of the most related language w.r.t. to the unseen language at test time gives bad performance overall.

To leverage the knowledge infused in the language adapters to deal with an unseen language, they propose two new techniques:

1. Adapter ensembling: this is a very simple strategy where $$h$$, the output of a generic transformer layer before the language adapters, is passed to each of the $$R$$ language adapters and the average is passed to the task adapter: $$h \mapsto \mathcal{L}_{avg}(h) = \frac{1}{R} \sum_{i=1}^{R} \mathcal{L}_{i}(h)$$

2. Entropy Minimized Ensemble of Adapters (EMEA): this approach tries to cover the major issue in the naive ensembling, which is the equal weighting of the language adapters, by having a weighted average of their predictions: $$h \mapsto \mathcal{L}_{wavg}(h) = \frac{1}{R} \sum_{i=1}^{R} \alpha_i \mathcal{L}_{i}(h), \quad \sum_{i=1}^{R} \alpha_i = 1, \; \alpha_i \geq 0$$ A visual comparison between the MAD-X architecture and EMEA is provided in the picture below.

Clearly, the $$\alpha_i$$’s should depend on the test samples: each sample should use a different spectrum of linguistic knowledge from the adapters, in particular from the adapter in the most related language. If we let $$\mathbf{\alpha}$$ be the vector of the $$\alpha_i$$’s, then a good $$\mathbf{\alpha}$$ for a input sample $$x$$ should be the one that makes the model more confident in its predictions, which means lowering the entropy $$H(x ; \mathbf{\alpha})$$ of its output distribution. To decrease the entropy by tweaking the $$\alpha_i$$’s, we can simply compute $$g_i = \nabla_{\alpha_i} H(x ; \mathbf{\alpha})$$ and update the weights with a simple gradient descent update $$\alpha_i = \alpha_i - \gamma g_i$$, where $$\gamma$$ is the learning rate. This procedure is iterated for $$T$$ times before the final prediction for the test sample $$x$$ is produced.

#### Results and overall considerations

The results for NER and POS tasks show large gains for non-Latin languages in particular, probably because these are the ones the model is more uncertain about. EMEA with only one update step ($$T=1$$) is able to surpass the naive ensembling, and with $$T=10$$ update steps the gap grows larger (at a higher computational cost).

This approach is able to solve issue (2) and issue (3), as the method is applied at test time with unseen languages and makes the different languages interact with each other to produce the output at each layer. The main benefit of this approach is that it requires minimal modifications to the MAD-X architecture, enabling it to adapt to new language scenarios. This however still implies having one adapter per language (issue (1)), which is something that CPG addresses directly by having a single model predict the adapter parameters for any language.

## Conclusions

In this blog post, we discussed some new approaches to cross-lingual transfer by using adapters: we focused on improvements over the MAD-X architecture, which needs one language adapter per language and is unable to adapt to new languages at test time.

The first approach, named MAD-G, learns to contextually generate the adapter parameters by leveraging the linguistic knowledge coming from an external source (URIEL). This makes the model faster to train, as only one model is needed for all the languages. Improving the knowledge of an MMT of a language seen during pretraining still seems to be quite difficult.

The second approach, named EMEA, uses the same architecture as in MAD-X but learns at test time how to correctly use the linguistic information in the language adapters for each of the test samples by minimizing the entropy of its output distribution. The authors prove that simply using the adapter for the most related language does not work, and that ensembling particularly helps non-Latin languages. This method however introduces longer inference times, due to the adaptation phase which is done by gradient updates for each of the test samples.

To sum up, adapters seem like a promising direction for cross-lingual transfer, as they allow for a simple solution to the “curse of multilinguality” and can be easily swapped in and out of the main architecture. These papers shed some light on how to make the architecture even lighter and with broader generalization capabilities, even though some questions are still unanswered, such as how to improve the knowledge of languages seen during the pretraining of MMTs.