Theses and Dissertations (Computer Science)

Permanent URI for this collectionhttp://hdl.handle.net/2263/32356

Browse

Now showing 1 - 20 of 223

A structured-based genetic programming generation construction hyper-heuristic with transfer learning for combinatorial optimisation
(University of Pretoria, 2024-12) Pillay, Nelishia; u16006250@tuks.co.za; Scheepers, Darius
Genetic programming and variants of genetic programming such as grammar-based genetic program ming have predominately been used in generation construction hyper-heuristics (GC-HH). Previous work has also shown the effectiveness of transfer learning in genetic programming generation hyper heuristics. Structure-based genetic programming (SBGP) uses both the fitness of an individual and its structure to direct the search in a search space. This study investigates the use of a structure-based genetic programming hyper-heuristic (SBGP-HH) in generation construction hyper-heuristics. The use of SBGP-HH with transfer learning (SBGP-HH-TL) is also investigated. The proposed approaches were evaluated on the examination timetabling, one dimensional bin-packing and capacitated vehicle routing problems. SBGP-HH was found to outperform the canonical genetic programming hyper-heuristic (CGP-HH) for the selected problem domains. SBGP-HH-TL produced better results than SBGP-HH with statistical significance on most problem instances. These results were found to be statistically significant at the 90% level of confidence. SBGP-HH-TL was found to outperform CGP-HH with transfer learning (CGP-HH-TL) for the selected problem domains.
Assessing interpretability in machine translation models for low-resource languages
(University of Pretoria, 2024-12) Marivate, Vukosi; u17391718@tuks.co.za; Gomba, Tsholofelo
In recent years, we have seen an increase in the adoption of Large Language Models (LLM) usage across many different applications. A practical example is OpenAI’s ChatGPT, a tool based on InstructGPT that uses pre-training combined with questioning answering and guidance with reinforcement learning with human feedback. A gap that still exists, the need for better coverage of low resource languages, has led to a substantial amount of research focused on multilingual LLMs in the Natural Language Processing (NLP) domain bringing about models such as NLLB-200, Glot500-m, and BLOOM. However, most of these black box multilingual LLMs fail at representing low resource languages, especially when applied to translation tasks, as their internal logic remain hidden from the user. This leaves one unable to account for or explain reasons for failures in real-life translations tasks. This research investigates the performance and interpretability of two models, a LLM and a small-scale model, trained on low-resource language pairs Xhosa Zulu and Tswana-Zulu. Both models make use of the transformer architecture. The research aims to evaluate the differences in translation quality and interpretability between the models, examining the role of attention mechanisms in capturing context and ensuring correct translations. The research aims to evaluate the (1) differences in translation quality and interpretability between models of different scales, (2) the impact of training dataset sizes on translation quality, and (3) the effectiveness of post-model eXplainable AI (XAI) methods in evaluating generated translations and model efficiency in low-resource language settings. The post-model methods used are attention pattern analysis, BLEU scores, MMD scores and human evaluation methods. We conclude that larger models handle linguistic complexities better, training on larger datasets generally improves translation quality, and diverse post-hoc evaluation methods are essential for a comprehensive assessment. This analysis contributes to a better understanding of the strengths and weaknesses of different model scales in machine translation, guiding future developments in XAI for machine translation of languages such as Swati, Tshiluba, Yoruba and other low-resource languages.
Multimodal misinformation detection in the South African social media environment
(University of Pretoria, 2024-10) Marivate, Vukosi; Modupe, Abiodun; u22960415@tuks.co.za; De Jager, Amica
The prevalence of the computer information system, personal communication devices, and the globalisation of the Internet and social media such as Facebook, Twitter and many more have reshaped our lives. These online social media platforms have revolutionised communication and information processing. People use these online social media platforms conveniently to share perspectives or personal messages in text, images, and video. However, while people enjoy social media or online social networking sites with snippets of textual and visualised content, deceptive activities like misinformation, disinformation, fake news, rumours, and spam mislead users by providing false information. Therefore, the widespread dissemination of information on social media and the Internet poses serious potential hazards to critical infrastructures like national security, health, and supply chains, potentially leading to shortages of essential commodities. Misinformation during the 2020 United States presidential election led to widespread confusion and public distrust, highlighting the need for users to critically assess information before believing it. Misinformation detection (MD) on social media has garnered significant attention and is a growing area of research interest. Unfortunately, existing methods often do not utilise textual and visual content simultaneously to understand the related and unrelated information so as to quantify the reported information as real information or not, particularly in the South African social media context. Furthermore, these methods heavily depend on manually crafted features from data and find it challenging to detect subtle forms of false information. The current methods are time-consuming, inefficient, and need constant updates for new trends, limiting their adaptability. In this dissertation, we are seeking to investigate the efficacy of misinformation detection models within the context of the South African social media environment. As a result, we proposed MMiC, a multimodal misinformation detection (MD) model that draws on a variety of information sources, including textual and visual aspects. Firstly, we use a pre-trained BERT model as a transformer-based model as an encoder to learn the underlying psychological representation in the textual data in a natural language, then use a pre-trained ResNet model to decode the visual content. Secondly, we amalgamated both the encoding and decoding layers over a fully connected layer and a SoftMax function to make the prediction. Throughout the investigation, the MMiC model undergoes comparisons with other baseline models and optimizations across multiple design cycles. These cycles involve developing the base framework, selecting the optimal combination of textual and visual encoders, and comparing different methods of multimodal feature fusion. The MMiC model is assessed in both a general context and specifically in the local South African context. The experiment results show that the MMiC model performs as well as the best current MD models (88% of the time) on the benchmark dataset (Fakeddit); adding local samples to the training dataset improves model performance by an average of 29%; and the MMiC model can accurately spot false information on South African social media sites (89% of the time). The results show that cultural differences in the places where MD models work affect their performance and that utilising multiple forms of communication can enhance knowledge transfer across various settings. Incorporating local data into training misinformation detection models is crucial to enhancing their performance. Moreover, including data from the local context helps ensure that the models are effective and accurate in various settings. We firmly believe that MMiC has the potential to facilitate the development and implementation of a misinformation detection system to combat misinformation in South Africa. Limitations encountered in this research include: obtaining access to existing MD datasets and state-of-the-art pre-trained models. Recommendations for future research involve expanding the subset of the local dataset that was used in this research to include samples from all social media platforms. Another recommendation would be to investigate the use of more complex methods in which to fuse the multimodal feature vector.
Exploring cross-lingual learning techniques for advancing Tshivenda NLP coverage
(University of Pretoria, 2023-06) Marivate, Vukosi; Mazarura, Jocelyn; Nemakhavhani, Ndamulelo
The information age has been a critical driver in the impressive advancement of Natural Language Processing (NLP) applications in recent years. The benefits of these applications have been prominent in populations with relatively better access to technology and information. On the contrary, low-resourced regions such as South Africa have seen a lag in NLP advancement due to limited high-quality datasets required to build reliable NLP models. To address this challenge, recent studies on NLP research have emphasised advancing language-agnostic models to enable Cross-Lingual Language Understanding (XLU) through cross-lingual transfer learning. Several empirical results have shown that XLU models work well when applied to languages with sufficient morphological or lexical similarity. In this study, we sought to exploit this capability to improve Tshivenda NLP representation using Sepedi and other related Bantu languages with relatively more data resources. Current state-of-the-art cross-lingual language models such as XLM-RoBERTa are trained on hundreds of languages, with most being high-resourced languages from European origins. Although the cross-lingual performance of these models is impressive for popular African lan- guages such as Swahili, there is still plenty of room left for improvement. As the size of such models continues to soar, questions have been raised on whether competitive performance can still be achieved using downsized training data to minimise the environmental impact yielded by ever-increasing computational requirements. Fortunately, practical results from AfriBERTa, a multilingual language model trained on a 1GB corpus from eleven African languages, showed that this could be a tenable approach to address the lack of representation for low-resourced languages in a sustainable way. Inspired by these recent triumphs in studies including XLM-RoBERTa and AfriBERTa, we present Zabantu-XLM-R, a novel fleet of small-scale, cross-lingual, pre-trained language models aimed at enhancing NLP coverage of Tshivenda. Although the study solely focused on Tshivenda, the presented methods can be easily adapted to other least-popular languages in South Africa, such as Xhitsonga and IsiNdebele. The language models have been trained on different sets of South African Bantu languages, with each set chosen heuristically based on the similarity to Tshivenda. We used a novel news headline dataset annotated following the International Press Telecommunications Council(IPTC) standards to conduct an extrinsic evaluation of the language models on a short text classification task. Our custom language models showed an impressive average weighted F1-score of 60% in few- shot settings with as little as 50 examples per class from the target language. We also found that open-source languages like AfriBERTa and AFroXLMR exhibited similar performance, although they had a minimal representation of Tshivenda and Sepedi in their pre-training corpora. These findings validated our hypothesis that we can leverage the relatedness among Bantu languages to develop state-of-the-art NLP models for Tshivenda. To our knowledge, no similar work has been carried out solely focusing on few-shot performance on Tshivenda.
Learning industrial descriptions : NLP tasks for acronym expansion
(University of Pretoria, 2024-02) Marivate, Vukosi; Johnson, Shaun
The human language is cryptic since words can be interpreted differently based upon the context within which they occur. The exact meaning of a particular word in its context might be trivial for humans who are generally unaware of language ambiguities. Machines, on the other hand, are required to process, transform and analyse unstructured textual information to determine the underlying meaning. “Acronyms” are shorter versions of phrases and are advantageous to save time and space for both handwritten and typed out “expansions or meanings”. The main disadvantage caused by acronyms is confusion; if misunderstood they can unknowingly cause damage, have a negative effect, or abuse the receiver. Acronyms in one context might not be appropriate for a audience in another context for the same acronym. Solving acronym disambiguation could help reduce the negative effects of using acronyms. In this project we apply NLP technologies for a case study at a particular organisation in the Mining, Metals & Minerals ( MMM) sector. The MMM organisation plant sensors’ tags (the acronyms) are derived by domain experts from technical programmable logic controller ( PLC) names into pseudo English (metallurgical) descriptions, these being the ground truth expansions, to describe the sensors adequately for multiple stakeholders (including non-domain experts). There is varied human input, leading to inconsistency in initiating “tag names (acronyms)”, and this leads to uncertainty of various degrees in trying to derive an “accurate description from the tags (acronym expansions)”. The aim of this research is to gauge to what extent transfer learning can be applied between similar domains using large language models. For example, Scientific document understanding could possibly explain some Mining, Metals & Minerals acronyms. This leads us to the research question, can NLP pre-trained transformers be applied to the MMM industry for which there are low resource settings and little (or no) acronym dictionaries? We presented a SciAD/ SDU fine-tuned transformers that can disambiguate acronyms within Scientific document understanding ( SDU) context very well and is a stepping stone to being used in the Mining, Metals & Minerals ( MMM) domain in future. We foresee that there is still opportunity to unlock the benefits of other pre-trained language models ( PLM). We note the value that a small model could be used for the MMM domain.
Sentiment analysis using unsupervised learning for local government elections in South Africa
(University of Pretoria, 2023-11) Marivate, Vukosi; Olaleye, Kayode; u22826476@tuks.co.za; Matloga, Mokgadi Penelope
Understanding public sentiment is vital for political parties in order for them to be able to structure their election campaigns around voter expectations. The study focuses on unsupervised learning to assess the variation of polarity sentiment in tweets during the 2021 South African local government election campaign. The study uses a pre-trained twitter-roberta-base-sentiment-latest model from Hugging Face and unsupervised lexicon based pre-trained approaches, namely: VADER and TextBlob to determine the polarity sentiment in order to gain insight that could be applied towards informing political campaigns and to see if there are any distinct sentiment patterns or shifts during different phases of the 2021 local government elections campaigns. Furthermore, the study applies the use of suspicious patterns and K-Means methods to classify the users as either bots and human using to be able to identify the user behind the keyboard. The study also make use of OpenAI GPT model to label the dataset for fine-tuning and addresses the issue of class imbalance. VADER and TextBlob results show a significant difference from that of the twitter-roberta-base-sentiment-latest models when comparing the statistical distribution based on the sentiment results and the user classification results. Based on the results, there is a significant variation across all sentiment classes and they vary over time. Furthermore, the results revealed TRBSL and TRBSL** outperforms VADER and TextBlob based on the scores for weighted accuracy and F1-scores. It was discovered that most of the tweets were generated by humans, with only few being identified as bot-generated and having a negative sentiments.
Analysing public transport user sentiment
(University of Pretoria, 2024) Marivate, Vukosi; Abdulmumin, Idris; Myoya, Rozina L.
In many Sub-Saharan countries, the advancement of public transport is frequently overshadowed by more prioritised sectors, highlighting the need for innovative approaches to enhance both the Quality of Service (QoS) and the overall user experience. This research aimed at mining the opinions of commuters to shed light on the prevailing sentiments regarding public transport systems. Concentrating on the experiential journey of users, the study adopted a qualitative research design, utilising real-time data gathered from Twitter to analyse sentiments across three major public transport modes: rail, mini-bus taxis, and buses. By employing Multilingual Opinion mining techniques, the research addressed the challenges posed by linguistic diversity and potential code-switching in the dataset, showcasing the practical application of Natural Language Processing (NLP) in extracting insights from under-resourced language data. The primary contribution of this study lies in its methodological approach, offering a framework for conducting sentiment analysis on multilingual and low-resource languages within the context of public transport. The findings hold potential implications beyond the academic realm, providing transport authorities and policymakers with a methodological basis to harness technology in gaining deeper insights into public sentiment. By prioritising the analysis of user experiences and sentiments, this research provides a pathway for the development of more responsive, usercentered public transport systems in Sub-Saharan countries, thereby contributing to the broader objective of improving urban mobility and sustainability.
Using NER and Doc2Vec to cluster South African criminal cases
(University of Pretoria, 2021) Marivate, Vukosi; u13140443@tuks.co.za; Nchachi, Carel Kagiso
The judicial system is the central pillar of law and order across the world. It is re- sponsible for maintaining order amongst citizens and also solving litigations that arise. Although this system has worked quite well, there still exists several challenges, such as racial biases in cases, shortage of legal professionals and inconsistencies with regards to rulings in cases. These challenges need to be addressed in order to maintain law and order in society and to help strengthen the criminal justice system. Researchers have incorporated Natural Language Processing (NLP) techniques to help address some of these challenges. Focusing primarily on three legal applications, which are Legal Judgment Prediction (LJP), Similar Case Matching (SCM) and Legal Question Answering (LQA)[28]. SCM focuses on identifying the relationships among cases using the available informa- tion. In other words, SCM is focused on segmenting or grouping legal cases. This is especially useful for Common Law judicial systems, where judicial decisions are based on similar and representative cases that have happened in the past. South Africa uses this type of judicial system. Although good progress has been made in SCM applications, there currently exists sev- eral challenges found in the these models. These challenges include using entities found in a legal document to improve the matching of similar cases and the interpretability of these models. In this research we will focus on applying the SCM application on South African criminal cases, by creating a model that will be able to match similar crime cases together. This model will also solve the two challenges currently faced in SCM applications. We found that using a Named Entity Recognizer (NER) with a Paragraph Vector- Distributed memory (PV-DM) model produced better results than using conventional PV-DM or TFIDF model. This model also overcomes the current SCM challenges as it uses the entities found in cases as the main variables for the model (using the NER model). Since the entities help explain how the model mapped similar case, this makes the model also interpretable. Based on the accuracy (similarity score) of the model, we can use this model as tool to segment criminal cases in real life.
An investigation of the effectiveness of using Twitter data for predicting South African protests with Graph Neural Networks
(University of Pretoria, 2024-04) Marivate, Vukosi; Ahmed, Maxamed; Ngomane, Derwin
Social media creates an echo chamber effect that is closely related to social movement theory, which aims to mobilise people to change society. In South Africa, there has been an increase in protests that appear to have started on social media. For example, consider the riots that occurred in July 2021 following the arrest of former President Jacob Zuma. Protests in South Africa, on the other hand, have culminated in violent incidents, such as the July 2021 protest. In that situation, the South African Human Rights Commission found that social media sites such as WhatsApp, Facebook, and Twitter aided the violence by sharing protest information. This study investigates whether social media can be utilised to signal upcoming South African protests. This research investigates the effectiveness of nose reduction techniques on Twitter data for predicting protest-related events in South Africa using Graph Neural Networks. It addresses research gaps by addressing the need for graph-based methodologies in the South African context, addressing the lack of noise reduction research for Twitter data, and using an automated method to extract relevant keywords in the word networks. The work aims to provide a new avenue for noise reduction in real-world scenarios where future events have not occurred. This study examines a three-year data window between 2019 and 2021 using the Global Dataset of Events, Location, and Tone (GDELT) and Twitter data. GDELT focuses on CAMEO codes related to protests and conflict, while Twitter extracts social media text related to protest-related posts. A sliding window approach is used to combine the data, with noise-reduction filtration techniques guiding the filtration. This work explores the potential of processing Twitter data to reveal signals for improved predictive capability. Derivative metrics, from hashtags, links, and mentions, are used to reveal such signals. The study compares different machine learning methods, including Logistic Regression, Graph Convolutional Networks, and Graph Isomorphism Networks, to model the data. It is discovered that the geometric deep learning methods struggle with overfitting in hold-out testing data but are stable and have better cross-validation scores. The GIN model exhibits higher accuracy and isomorphism detection, making it suitable for the task. However, graph neural networks struggle with limited data and hence overfit the training data, as well as isomorphism and isolated nodes due to message-passing paradigm. The intricacy of Twitter interactions and conversations is highlighted in this work, empha- sising the need for future research in data processing and model building. The study excluded other data features to add more information about the data space’s complexity, such as user interactions. Keyword selection was done independently, but node eigenvector centrality could be used for informed decision-making. The graph neural network paradigm of message passing has limited capability in the existence of isolated nodes, and isomorphism is crucial for network performance. Further research should investigate dynamic capabilities and edge weights in GIN networks.
Real-time task schedulability analysis via spotlight abstraction
(University of Pretoria, 2023-12-08) Gruner, Stefan; Timm, Nils; u12238156@tuks.co.za; Nxumalo, Madoda
The schedulability analysis of real-time systems is challenged by the state space complexity of such systems. It is difficult to develop concrete state space models that can be used to verify the schedulability of real-time systems. This thesis solves the state space complexity problem by means of a new abstraction technique that enables an automated and efficient verification of schedulability properties of real-time task sets. The technique is applied to task queues under the {FIFO and EDF scheduling policies. The approach is based on the spotlight abstraction principle. The novel spotlight abstraction approach partitions the scheduler task queue into a `spotlight' and a `shade'. A small number of tasks that appear to a specified depth at the front of the queue and will be executed in the near future are placed in the spotlight. The shade contains the remaining tasks at the back of the queue which is executed only after the spotlight tasks have been processed. A timed automaton is generated from the spotlight to form an abstract model of the concrete system. The schedulability analysis of the spotlight is performed whereby the behaviour of the shade is summarised and the partial schedulability result for the spotlight tasks is saved for later re-use in the subsequent iterations. In each iteration, if the result is still inconclusive and there still exist tasks in the shade, more tasks are brought from the shade into the spotlight, with which the model checker can proceed. The iterations are continued until a decisive schedulability result (yes or no) is obtained. A new tool, called TVMC, that implements the spotlight abstraction-based model checking approach has been developed. An experimental performance evaluation of the abstraction-based TVMC against model checking without abstraction is presented. Empirical results showed that the execution times of the abstraction-based TVMC were faster than the ones of the approach without abstraction. Moreover, the spotlight abstraction TVMC handled larger task sets whereas the non-abstraction case failed to verify task sets with sizes greater than six due to state explosion. This work also presents an experimental comparison of TVMC against established tools: Timestool and the Uppaal platform based RTLib. TVMC was able to cope with the state explosion problem considerably better than Timestool and RTLib since it was able to handle significantly larger task sets and to finish the analysis in a similar amount of run-times.
A Study of bi-space search for bin packing problems
(University of Pretoria, 2024-03) Pillay, Nelishia; Nyathi, Thambo; u18319557@tuks.co.za; Beckedahl, Derrick
Traditionally, search techniques explore a single space, namely the solution space, to find a solution to a discrete optimisation problem. However, as the field has developed, the effectiveness of working in alternative spaces (such as the heuristic space) has been demonstrated. In addition, the most effective search techniques are computationally expensive. More recently, exploring more than one space to solve a problem has been investigated. This research has involved searching in the heuristic and solution space sequentially or alternating the search in each space. The first aim of this study is the introduction of the concept of concurrent bi-space search (CBS) which involves searching in both the solution and heuristic spaces concurrently. It is anticipated that this will be more effective than searching a single space or performing a search in both spaces sequentially. Previous work has shown that searching in alternative spaces, like the heuristic space, is computationally expensive. Furthermore, in an attempt to improve the quality of solutions found, computationally expensive approaches are used to explore the solution space. Thus, a secondary aim of this study is to use a search technique that is computationally cheap to concurrently search the solution and heuristic spaces. It is hypothesised that exploring both spaces concurrently will eliminate the need to use computationally expensive techniques to explore the solution space to produce solutions of effective quality. While the concept of CBS can be applied to any discrete optimisation problem, this study is restricted to packing problems, specifically the one-two- and three-dimensional bin packing problems (1BPP, 2BPP and 3BPP). The higher dimension BPPs are chosen to investigate the scalability of the approach. A simple local search is used to independently search the heuristic (HSS) and solution (SSS) spaces in order to obtain a baseline against which to compare the CBS approach which also employs a local search to concurrently search the heuristic and solution spaces. Performance comparison of the three approaches (CBS, HSS and SSS) is conducted using three different performance metrics, namely the number of bins, a measure of the total wasted space across the bins, i.e. the packing efficiency, and the computational time. For all three problem domains (1BPP, 2BPP and 3BPP) CBS outperforms both HSS and SSS in terms of the number of bins and the amount of wasted space. However, SSS has lower runtimes, with CBS having lower runtimes than HSS. These results are found to be statistically significant for the majority of the problem instances. When compared to previous bi-space search approaches, CBS is found to both produce better quality solutions and have faster average runtimes. The CBS approach is also compared to state-of-the-art techniques for 1BPP, 2BPP and 3BPP. The CBS approach does not outperform the state-of-the-art techniques for the simpler 1BPP, but is found to be scalable to the more difficult 2BPP and 3BPP, having comparable performance to the state-of-the-art techniques and in some cases outperforming them.
Analysis of catastrophic interference with application to spline neural architectures
(University of Pretoria, 2024-02-14) Bosman, Anna Sergeevna; heinrich.vandeventer@outlook.com; Van Deventer, Heinrich Pieter
Continual learning is the sequential learning of different tasks by a machine learning model. Continual learning is known to be hindered by catastrophic interference or forgetting, i.e. rapid unlearning of earlier learned tasks when new tasks are learned. Despite their practical success, artificial neural networks (ANNs) are prone to catastrophic interference. This study analyses how gradient descent and overlapping representations between distant input points lead to distal interference and catastrophic interference. Distal interference refers to the phenomenon where training a model on a subset of the domain leads to non-local changes on other subsets of the domain. This study shows that uniformly trainable models without distal interference must be exponentially large. A novel antisymmetric bounded exponential layer B-spline ANN architecture named ABEL-Spline is proposed that can approximate any continuous function, is uniformly trainable, has polynomial computational complexity, and provides some guarantees for distal interference. Experiments are presented to demonstrate the theoretical properties of ABEL-Splines. ABEL-Splines are also evaluated on benchmark regression problems. It is concluded that the weaker distal interference guarantees in ABEL-Splines are insufficient for model-only continual learning. It is conjectured that continual learning with polynomial complexity models requires augmentation of the training data or algorithm.
Nash equilibria in generalised dining philosophers games
(University of Pretoria, 2023) Timm, Nils; Goranko, Valentin; johan-vr1@hotmail.com; Van Rooyen, Johan Pieter
The Generalised Dining Philosophers Game (GDPG) consists of agents which must cooperate (or compete) for shared resources. As there are several cooperating agents, we can think of the GDPG as a multi-agent system. In such a system, there are naturally some qualitative objectives such as fairness and liveness; and quantitative objectives where the agents seek to satisfy their goal as frequently as possible. The GDPG is represented as a concurrent game model and the agents’ objectives are represented by LTL[F] formulas. There are some qualitative objectives which represent the goals of the entire group, and some quantitative objectives which represent the individual agents’ and should be optimised. From this point, the LTL[F] model checking procedure is modified to produce an automaton-based algorithm which identifies a strategy profile which satisfies the qualitative objectives, and also is a Nash Equilibrium with respect to the agent’s quantitative objectives. That is, at each configuration of the game, an action must be prescribed to each agent such that the collective objectives of the group are satisfied, and no agent can unilaterally deviate in order to achieve a better outcome.
Interpretable machine learning in natural language processing for misinformation data
(University of Pretoria, 2022-11) Marivate, Vukosi; Nkalashe, Yolanda
The interpretability of models has been one of the focal research topics in the machine learning community due to a rise in the use of black box models and complex state-of-the-art models [6]. Most of these models are debugged through trial and error, based on end-to-end learning [7, 48]. This creates some uneasiness and distrust among the end-user consumers of the models, which has resulted in limited use of black box models in disciplines where explainability is required [33]. However, alternative models, ”white-box models,” come with a trade-off of accuracy and predictive power [7]. This research focuses on interpretability in natural language processing for misinformation data. First, we explore example-based techniques through prototype selection to determine if we can observe any key behavioural insights from a misinformation dataset. We use four prototype selection techniques: Clustering, Set Cover, MMD-critic, and Influential examples. We analyse the quality of each technique’s prototype set and use two prototype sets that have the optimal quality to further process for word analysis, linguistic characteristics, and together with the LIME technique for interpretability. Secondly, we compare if there are any critical insights in the South African disinformation context.
South African isiZulu and siSwati news corpus creation, annotation and categorisation
(University of Pretoria, 2022) Marivate, Vukosi; Adendorff, M.; u18114564@tuks.co.za; Madodonga, Andani
South Africa has eleven official languages and amongst the eleven languages only 9 languages are local low-resourced languages. As a result, it is essential to build the resources for these languages so that they can benefit from advances in the field of natural language processing. In this project, the focus was to create annotated datasets for the isiZulu and siSwati local languages based on news topic classification tasks and present the findings from these baseline classification models. Due to the shortage of data for these local South African languages, the datasets that were created were augmented and oversampled to increase data size and overcome class classification imbalance. In total, four different classification models were used namely Logistic regression, Naive bayes, XGBoost and LSTM. These models were trained on three different word embeddings namely Count vectorizer, TFIDF vectorizer and word2vec. The results of this study showed that XGBoost, Logistic regression and LSTM, trained from word2vec performed better than the other combinations.
BantuBERTa : using language family grouping in multilingual language modeling for Bantu languages
(University of Pretoria, 2023) Marivate, Vukosi; Akinyi, Verrah; Parvess, Jesse
It was researched whether a multilingual Bantu pretraining corpus could be created from freely available data. Here, to create the dataset, Bantu text extracted from datasets that are freely available online (mainly from Huggingface) were used. The resulting multilingual language model (BantuBERTa) from this pretraining data proved to be predictive across multiple Bantu languages on a higher-order NLP task (NER) and in a simpler NLP task (classification). This proves that this dataset can be used for Bantu multilingual pretraining and transfer to multiple Bantu languages. Additionally, it was researched whether using this Bantu dataset could benefit transfer learning in downstream NLP tasks. BantuBERTa under-performed with respect to other models (XlM-R, mBERT, and AfriBERTa) bench-marked on MasakhaNER’s Bantu language tests (Swahili, Luganda, and Kinyarwanda). Additionally, it produced state of the art results for the Bantu language benchmarks (Zulu, and Lingala) in the African News Topic Classification dataset. It was surmised that the pretraining dataset size (which was 30% smaller than AfriBERTa’s) and dataset quality were the main cause for the poor performance in the NER test. We believe this is a case-specific failure due to poor data quality resulting from a pretraining dataset consisting mainly of web-scraped pages. Here, the resulting dataset consisted mainly of MC4 and CC100 Bantu text. However, on lower-order NLP tasks, like classification, pretraining on languages solely within the language family seemed to benefit transfer to other similar languages within the family. This potentially opens a method for effectively including low-resourced languages in low-level NLP tasks.
Using distributed ledger technology for digital forensic investigation purposes on tendering projects
(University of Pretoria, 2023-01-30) Hein, Venter; pardonramazhamba@gmail.com; Ramazhamba, Pardon Takalani
The tendering system used by the South African Government is regarded as a central method used by the organs of state to procure goods and services, including delivering some services to citizens with the aim of promoting social industrial, or environmental policies. Some of these projects are distributed using a tendering system aimed at developing and empowering the surrounding communities. Hence, the tendering system used by these organs of state should be fair, transparent, competitive, cost-effective, equitable, and free from corruption. However, the mismanagement of the tendering system might lead to interruption of operations, poor product quality, late service delivery, rising costs, and most importantly, fraud and corruption. The use of paperwork to share project information might also lead to the mismanagement of the tendering project because it might contribute towards illicit altering of project information during the process. This might also affect the fairness, transparency, data integrity, and competitiveness of the tendering system used by the South African Local Government. Additionally, the process of investigating any fraudulent activity is nearly impossible with the current paper-based tendering system. The purpose of this study is to implement a Blockchain prototype that can be used to securely share project information with all the participants that have an interest in the tendering project. This Blockchain prototype is called the Share Tendering Project (ShareTendPro) network. Diagrams were used to visualise the design of the ShareTendPro network. It is recommended that the use of the ShareTendPro network will enable various participants to have access to project information in real-time, allowing them to have access to the entire project history regardless of their geographical location. Access to real-time data would imply that the ShareTendPro network will also promote real-time auditing and digital forensic investigations because both auditors and investigators will have access to the project information of their interest in real-time. Additionally, the project information stored within the ShareTendPro network can also be regarded as credible digital evidence because it is immutable by default. Furthermore, the ShareTendPro network also seeks to reduce human interactions (i.e., human errors) by automating some of the processes within the network while assuring that all data is stored in a digital forensically sound format.
A study of pheromone maps for ant colony optimization hyper-heuristics
(University of Pretoria, 2022) Pillay, Nelishia; u14006512@tuks.co.za; Singh, Emilio
In recent years, there has been an increasing development of hyper-heuristics in the field of combinatorial optimisation. Broadly speaking, the term hyper-heuristic refers to a technique or algorithm that aims to provide a more generalised solution to, usually, a combinatorial problem. Hyper-heuristics differ from other combinatorial solution methods by working primarily in the heuristic space, as opposed to the solution space, to create more generalisable solutions for problems. There are four types of hyper-heuristics: generation constructive (GC), generation perturbative (GP), selection constructive (SC) and selection perturbative (SP). Each type functions by either generating new heuristics or by selecting which existing heuristics to apply to a problem. They are further delineated by whether the hyper-heuristic is constructive or perturbative with the former making solutions from scratch and the latter modifying and refining existing solutions. Despite increasing research into hyper-heuristics, one area where research has been lacking is in the use of ant algorithms by hyper-heuristics to drive the search through the heuristic space. While there have been some investigations into the employment of ant algorithms by hyper-heuristics, a comprehensive study into the use of ant algorithms and in particular their central search mechanism, the pheromone map, has largely not been done. This research endeavours to investigate and study the use of ant algorithms by the four different types of hyper-heuristic to search the heuristic space. The goal is to improve the employment of ant algorithms by hyper-heuristics through the study of how the pheromone map can be used to explore the heuristic space. A general ant algorithm for searching the heuristic space (HACO) was presented and extended for each of the four types of hyper-heuristics. This investigation specifically focused on examining the impact that using different pheromone maps (1D, 2D and 3D) would have on the ant algorithms used by the hyper-heuristics. Furthermore, a hybrid algorithm (HACOH), one that combines multiple HACO algorithms with their pheromone maps, was presented to improve upon the use of the different pheromone maps. The proposed algorithms (HACO and HACOH) were evaluated in multiple problem domains based on their use as one of the four hyper-heuristics. The SC and SP experiments were performed in the quadratic assignment problem (QAP) and movie scene scheduling problem (MSSP) domains. The GC experiments were conducted in the one-dimensional bin packing problem (1BPP) and MSSP domains and finally, the GP experiments were conducted in the capacitated vehicle routing problem (CVRP) and MSSP domains. These algorithms were assessed primarily in terms of optimality and generality although consideration of runtimes and comparisons with existing heuristics was included as well. The results showed that there were statistically significant differences between the different pheromone maps when used in ant-based hyper-heuristics across a wide number of the problem domains. The only exception was the SC-MSSP experiments where differences were observed but not significant. In these experiments, at least one type of pheromone map emerged as suboptimal for use in the hyper-heuristic in the problem domain. It was not always the case that a single type of pheromone map would predominate over the others, but the results indicated clear delineations between better or worse pheromone maps to use in hyper-heuristics across the domain experiments. The HACOH algorithm showed some promise in use in the generation hyper-heuristics, in the 1BPP and MSSP domains, but was generally inferior to a non-hybrid HACO algorithm in the majority of the experiments, indicating that the hybrid algorithm is not universally superior to its non-hybrid counterparts. These results have met the research objectives of this thesis by showing that, firstly, ant algorithms can be employed successfully, by all four types of hyper-heuristics. More importantly, however, the results showed that there are meaningful differences between the use of the different pheromone maps in ant-based hyper-heuristics and that choosing the optimal map for an ant-based hyper-heuristic depends on the problem domain among other factors.
Automated design of the deep neural network pipeline
(University of Pretoria, 2021) Pillay, Nelishia; u15016502@tuks.co.za; Gerber, Mia
Deep neural networks have been shown to be very effective for image processing and text processing. However the big challenge is designing the deep neural network pipeline, as it is time consuming and requires machine learning expertise. More and more non-experts are using deep neural networks in their day-to-day lives, but do not have the expertise to parameter tune and construct optimal deep neural network pipelines. AutoML has mainly focused on neural architecture design and parameter tuning, but little attention has been given to optimal design of the deep neural network pipeline and all of its constituent parts. In this work a single point hyper heuristic (SPHH) was used to automate iii the design of the deep neural network pipeline. The SPHH constructed a deep neural network pipeline design by selecting techniques to use at the various stages of the pipeline, namely: the preprocessing stage, the feature engineering stage, the augmentation stage as well as selecting a deep neural network architecture and relevant hyper-parameters. This work also investigated transfer learning by using a design that was created for one dataset as a starting point for the design process for a different dataset and the effect thereof was evaluated. The reusability of the designs themselves were also tested. The SPHH designed pipelines for both the image processing and text processing domain. The image processing domain covered maize disease detection and oral lesion detection specifically and text processing used sentiment analysis and spam detection, with multiple datasets being used for all the aforementioned tasks. The pipeline designs created by means of automated design were compared to manually derived pipelines from the literature for the given datasets. This research showed that automated design of a deep neural network pipeline using a single point hyper-heuristic is effective. Deep neural network pipelines designed by the SPHH are either better than or just as good as manually derived pipeline designs in terms of performance and application time. The results showed that the pipeline designs created by the SPHH are not reusable as they do not provide comparable performance to the results achieved when specifically creating a design for a dataset. Transfer learning using the designed pipelines is found to produce results comparable to or better than the results achieved when using the SPHH without transfer learning. Transfer learning is only effective when the correct target and source are chosen, for some target datasets negative transfer occurs when using certain datasets as the transfer learning source. Future work will include applying the automated design approach to more domains and making designs reusable. The transfer learning process will also be automated in future work to ensure positive transfer occurs. The last recommendation for future work is to construct a pipeline for unsupervised deep neural network techniques instead of supervised deep neural network techniques.
Ukhetho : A Text Mining Study Of The South African General Elections
(University of Pretoria, 2019) Marivate, Vukosi; avashlin@gmail.com; Moodley, Avashlin
The elections in South Africa are contested by multiple political parties appealing to a diverse population that comes from a variety of socioeconomic backgrounds. As a result, a rich source of discourse is created to inform voters about election-related content. Two common sources of information to help voters with their decision are news articles and tweets, this study aims to understand the discourse in these two sources using natural language processing. Topic modelling techniques, Latent Dirichlet Allocation and Non- negative Matrix Factorization, are applied to digest the breadth of information collected about the elections into topics. The topics produced are subjected to further analysis that uncovers similarities between topics, links topics to dates and events and provides a summary of the discourse that existed prior to the South African general elections. The primary focus is on the 2019 elections, however election-related articles from 2014 and 2019 were also compared to understand how the discourse has changed.

Browse

Recent Submissions