Exploring cross-lingual learning techniques for advancing Tshivenda NLP coverage

Exploring cross-lingual learning techniques for advancing Tshivenda NLP coverage

dc.contributor.advisor	Marivate, Vukosi
dc.contributor.coadvisor	Mazarura, Jocelyn
dc.contributor.postgraduate	Nemakhavhani, Ndamulelo
dc.date.accessioned	2024-09-13T12:01:23Z
dc.date.available	2024-09-13T12:01:23Z
dc.date.created	2024-04
dc.date.issued	2023-06
dc.description	Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023.	en_US
dc.description.abstract	The information age has been a critical driver in the impressive advancement of Natural Language Processing (NLP) applications in recent years. The benefits of these applications have been prominent in populations with relatively better access to technology and information. On the contrary, low-resourced regions such as South Africa have seen a lag in NLP advancement due to limited high-quality datasets required to build reliable NLP models. To address this challenge, recent studies on NLP research have emphasised advancing language-agnostic models to enable Cross-Lingual Language Understanding (XLU) through cross-lingual transfer learning. Several empirical results have shown that XLU models work well when applied to languages with sufficient morphological or lexical similarity. In this study, we sought to exploit this capability to improve Tshivenda NLP representation using Sepedi and other related Bantu languages with relatively more data resources. Current state-of-the-art cross-lingual language models such as XLM-RoBERTa are trained on hundreds of languages, with most being high-resourced languages from European origins. Although the cross-lingual performance of these models is impressive for popular African lan- guages such as Swahili, there is still plenty of room left for improvement. As the size of such models continues to soar, questions have been raised on whether competitive performance can still be achieved using downsized training data to minimise the environmental impact yielded by ever-increasing computational requirements. Fortunately, practical results from AfriBERTa, a multilingual language model trained on a 1GB corpus from eleven African languages, showed that this could be a tenable approach to address the lack of representation for low-resourced languages in a sustainable way. Inspired by these recent triumphs in studies including XLM-RoBERTa and AfriBERTa, we present Zabantu-XLM-R, a novel fleet of small-scale, cross-lingual, pre-trained language models aimed at enhancing NLP coverage of Tshivenda. Although the study solely focused on Tshivenda, the presented methods can be easily adapted to other least-popular languages in South Africa, such as Xhitsonga and IsiNdebele. The language models have been trained on different sets of South African Bantu languages, with each set chosen heuristically based on the similarity to Tshivenda. We used a novel news headline dataset annotated following the International Press Telecommunications Council(IPTC) standards to conduct an extrinsic evaluation of the language models on a short text classification task. Our custom language models showed an impressive average weighted F1-score of 60% in few- shot settings with as little as 50 examples per class from the target language. We also found that open-source languages like AfriBERTa and AFroXLMR exhibited similar performance, although they had a minimal representation of Tshivenda and Sepedi in their pre-training corpora. These findings validated our hypothesis that we can leverage the relatedness among Bantu languages to develop state-of-the-art NLP models for Tshivenda. To our knowledge, no similar work has been carried out solely focusing on few-shot performance on Tshivenda.	en_US
dc.description.availability	Unrestricted	en_US
dc.description.degree	MIT (Big Data Science)	en_US
dc.description.department	Computer Science	en_US
dc.description.faculty	Faculty of Engineering, Built Environment and Information Technology	en_US
dc.identifier.citation	*	en_US
dc.identifier.other	A2024	en_US
dc.identifier.uri	http://hdl.handle.net/2263/98198
dc.language.iso	en	en_US
dc.publisher	University of Pretoria
dc.rights	© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subject	UCTD	en_US
dc.subject	Natural Language Processing (NLP)	en_US
dc.subject	Tshivenda NLP coverage	en_US
dc.subject	Cross-Lingual Learning Techniques	en_US
dc.subject	Low-resource NLP	en_US
dc.subject	XLM-Roberta	en_US
dc.title	Exploring cross-lingual learning techniques for advancing Tshivenda NLP coverage	en_US
dc.type	Mini Dissertation	en_US

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Nemakhavhani_Exploring_2023.pdf
Size:: 16.3 MB
Format:: Adobe Portable Document Format
Description:: Mini Dissertation

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 1.71 KB
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Theses and Dissertations (University of Pretoria)
Theses and Dissertations (Computer Science)

Simple item page