South African isiZulu and siSwati news corpus creation, annotation and categorisation

South African isiZulu and siSwati news corpus creation, annotation and categorisation

Files

AndaniMadodonga_MiniDissertation_Final.pdf (5.07 MB)

Date

2022

Authors

Madodonga, Andani

Publisher

University of Pretoria

Abstract

South Africa has eleven official languages and amongst the eleven languages only 9 languages are local low-resourced languages. As a result, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this project, the focus was to create annotated datasets for the isiZulu and siSwati local languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these local South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Count vectorizer, TFIDF vectorizer and word2vec. The results of this study showed that XGBoost, Logistic regression and LSTM, trained from word2vec performed better than the other combinations.

Description

Mini Dissertation (MIT (Big Data Science))--University of Pretoria, 2022.

Keywords

UCTD, South African Local Languages, Low Resources Languages, Data Augmentation, Topic Classification, Logistic regression

Citation

*

URI

http://hdl.handle.net/2263/92767

Collections

Theses and Dissertations (University of Pretoria)
Theses and Dissertations (Computer Science)

Full item page

South African isiZulu and siSwati news corpus creation, annotation and categorisation

Files

Date

Authors

Journal Title

Journal ISSN

Volume Title

Publisher

Abstract

Description

Keywords

Sustainable Development Goals

Citation

URI

Collections