South African isiZulu and siSwati news corpus creation, annotation and categorisation

dc.contributor.advisorMarivate, Vukosi
dc.contributor.coadvisorAdendorff, M.
dc.contributor.emailu18114564@tuks.co.zaen_US
dc.contributor.postgraduateMadodonga, Andani
dc.date.accessioned2023-10-09T08:01:33Z
dc.date.available2023-10-09T08:01:33Z
dc.date.created2023-04
dc.date.issued2022
dc.descriptionMini Dissertation (MIT (Big Data Science))--University of Pretoria, 2022.en_US
dc.description.abstractSouth Africa has eleven official languages and amongst the eleven languages only 9 languages are local low-resourced languages. As a result, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this project, the focus was to create annotated datasets for the isiZulu and siSwati local languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these local South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Count vectorizer, TFIDF vectorizer and word2vec. The results of this study showed that XGBoost, Logistic regression and LSTM, trained from word2vec performed better than the other combinations.en_US
dc.description.availabilityUnrestricteden_US
dc.description.degreeMIT (Big Data Science)en_US
dc.description.departmentComputer Scienceen_US
dc.identifier.citation*en_US
dc.identifier.otherA2023en_US
dc.identifier.urihttp://hdl.handle.net/2263/92767
dc.language.isoenen_US
dc.publisherUniversity of Pretoria
dc.rights© 2021 University of Pretoria. All rights reserved. The copyright in this work vests in the University of Pretoria. No part of this work may be reproduced or transmitted in any form or by any means, without the prior written permission of the University of Pretoria.
dc.subjectUCTDen_US
dc.subjectSouth African Local Languagesen_US
dc.subjectLow Resources Languagesen_US
dc.subjectData Augmentationen_US
dc.subjectTopic Classificationen_US
dc.subjectLogistic regressionen_US
dc.titleSouth African isiZulu and siSwati news corpus creation, annotation and categorisationen_US
dc.typeMini Dissertationen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
AndaniMadodonga_MiniDissertation_Final.pdf
Size:
5.07 MB
Format:
Adobe Portable Document Format
Description:
Mini Dissertation

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: