Exploring cross-lingual learning techniques for advancing Tshivenda NLP coverage

dc.contributor.advisorMarivate, Vukosi
dc.contributor.coadvisorMazarura, Jocelyn
dc.contributor.postgraduateNemakhavhani, Ndamulelo
dc.date.accessioned2024-09-13T12:01:23Z
dc.date.available2024-09-13T12:01:23Z
dc.date.created2024-04
dc.date.issued2023-06
dc.descriptionMini Dissertation (MIT (Big Data Science))--University of Pretoria, 2023.en_US
dc.description.abstractThe information age has been a critical driver in the impressive advancement of Natural Language Processing (NLP) applications in recent years. The benefits of these applications have been prominent in populations with relatively better access to technology and information. On the contrary, low-resourced regions such as South Africa have seen a lag in NLP advancement due to limited high-quality datasets required to build reliable NLP models. To address this challenge, recent studies on NLP research have emphasised advancing language-agnostic models to enable Cross-Lingual Language Understanding (XLU) through cross-lingual transfer learning. Several empirical results have shown that XLU models work well when applied to languages with sufficient morphological or lexical similarity. In this study, we sought to exploit this capability to improve Tshivenda NLP representation using Sepedi and other related Bantu languages with relatively more data resources. Current state-of-the-art cross-lingual language models such as XLM-RoBERTa are trained on hundreds of languages, with most being high-resourced languages from European origins. Although the cross-lingual performance of these models is impressive for popular African lan- guages such as Swahili, there is still plenty of room left for improvement. As the size of such models continues to soar, questions have been raised on whether competitive performance can still be achieved using downsized training data to minimise the environmental impact yielded by ever-increasing computational requirements. Fortunately, practical results from AfriBERTa, a multilingual language model trained on a 1GB corpus from eleven African languages, showed that this could be a tenable approach to address the lack of representation for low-resourced languages in a sustainable way. Inspired by these recent triumphs in studies including XLM-RoBERTa and AfriBERTa, we present Zabantu-XLM-R, a novel fleet of small-scale, cross-lingual, pre-trained language models aimed at enhancing NLP coverage of Tshivenda. Although the study solely focused on Tshivenda, the presented methods can be easily adapted to other least-popular languages in South Africa, such as Xhitsonga and IsiNdebele. The language models have been trained on different sets of South African Bantu languages, with each set chosen heuristically based on the similarity to Tshivenda. We used a novel news headline dataset annotated following the International Press Telecommunications Council(IPTC) standards to conduct an extrinsic evaluation of the language models on a short text classification task. Our custom language models showed an impressive average weighted F1-score of 60% in few- shot settings with as little as 50 examples per class from the target language. We also found that open-source languages like AfriBERTa and AFroXLMR exhibited similar performance, although they had a minimal representation of Tshivenda and Sepedi in their pre-training corpora. These findings validated our hypothesis that we can leverage the relatedness among Bantu languages to develop state-of-the-art NLP models for Tshivenda. To our knowledge, no similar work has been carried out solely focusing on few-shot performance on Tshivenda.en_US
dc.description.availabilityUnrestricteden_US
dc.description.degreeMIT (Big Data Science)en_US
dc.description.departmentComputer Scienceen_US
dc.description.facultyFaculty of Engineering, Built Environment and Information Technologyen_US
dc.identifier.citation*en_US
dc.identifier.otherA2024en_US
dc.identifier.urihttp://hdl.handle.net/2263/98198
dc.language.isoenen_US
dc.publisherUniversity of Pretoria
dc.rights© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subjectUCTDen_US
dc.subjectNatural Language Processing (NLP)en_US
dc.subjectTshivenda NLP coverageen_US
dc.subjectCross-Lingual Learning Techniquesen_US
dc.subjectLow-resource NLPen_US
dc.subjectXLM-Robertaen_US
dc.titleExploring cross-lingual learning techniques for advancing Tshivenda NLP coverageen_US
dc.typeMini Dissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Nemakhavhani_Exploring_2023.pdf
Size:
16.3 MB
Format:
Adobe Portable Document Format
Description:
Mini Dissertation

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: