The impact of language family on D2T generation in under-resourced languages

Publication date

DOI

Document Type

Master Thesis

Collections

Open Access logo

License

CC-BY-NC-ND

Abstract

The paper delves deep into the challenges and methods of generating text from structured data(RDF triples) for under-resourced languages based on the WebNLG challenge. The main question of this paper is to assess how important language families are in helping the model generate text without prior examples(zero-shot) in the WebNLG target languages. In the paper, we work with limited resources. We utilize an already pre-trained encoder-decoder LLM, such as mT5-small, to test the hypothesis, and we train as much as necessary before we notice a plateau in performance since there are hardware limitations. By applying further pre-training and testing different finetuning strategies, the aim is to improve text coherence and fluency and assess how well the model extracts the information from the RDF triples. As part of our ablation experimentation, we pre-train and finetune to assess their impact on the D2T task. The experimentation starts with the simplest model, pre-trained on the OPUS-100 dataset and finetuned on the English WebNLG dataset. The pre-training recipe remains the same, but for the finetuning step, the WebNLG dataset is altered to include more linguistically diverse language samples. Lastly, we introduce an augmentation technique to alter theWebNLG dataset further and generate samples for all the relative languages we are trying to target. In the end, the best finetuning strategy is applied to a clean mT5 model to assess the influence of the pre-training. Later on, in the meta-experiments, we generate augmented data for the languages we target in WebNLG, which takes the models out of the zero-shot setting. With these extensive experiments, we try to measure model performance using automatic metrics involving manual analysis and a comparison with other models in the WebNLG 2023 challenge. For the assessment, we use automatic metrics such as BLEU, ROUGE, METEOR, TER, chrF++, BertScore and PARENT to provide a more holistic view of our model’s capabilities. In a few words, this study aims to contribute in the following ways:(a) What is the influence of language families under a zero-shot setting? (b) is further pre-training necessary, or does it tend to have diminishing results? (c) Does finetuning with noisy data provide any benefit? Lastly, (d) How does our model compare with the other models of the WebNLG 2023 challenge?

Keywords

Citation