Optical character recognition and text cleaning in the indigenous South African languages

dc.contributor.authorPrinsloo, Danie J. (Daniel Jacobus), 1953-
dc.contributor.authorTaljard, Elsabe (Elizabeth)
dc.contributor.authorGoosen, Michelle
dc.contributor.emaildanie.prinsloo@up.ac.zaen_US
dc.date.accessioned2023-10-18T12:25:36Z
dc.date.available2023-10-18T12:25:36Z
dc.date.issued2022
dc.description.abstractThis article represents follow-up work on unpublished presentations by the authors of text and corpus cleaning strategies for the African languages. In this article we provide a comparative description of cleaning of web-sourced and text-sourced material to be used for the compilation of corpora with specific attention to cleaning of text-based material, since this is particularly relevant for the indigenous South African languages. For the purposes of this study, we use the term “web-sourced material” to refer to digital data sourced from the internet, whereas “text-based material” refers to hard copy textual material. We identify the different types of errors found in such texts, looking specifically at typical scanning errors in these languages, followed by an evaluation of three commercially available Optical Character Recognition (OCR) tools. We argue that the cleanness of texts is a matter of granularity, depending on the envisaged application of the corpus comprised by the texts. Text corpora which are to be utilized for e.g. lexicographic purposes can tolerate a higher level of ‘noise’ than those used for the compilation of e.g. spelling and grammar checkers. We conclude with some suggestions for text cleaning for the indigenous languages of South Africa.en_US
dc.description.departmentAfrican Languagesen_US
dc.description.librarianam2023en_US
dc.description.sponsorshipThe South African Centre for Digital Language Resources (SADiLaR) and the National Research Foundation of South Africa.en_US
dc.description.urihttp://spil.journals.ac.zaen_US
dc.identifier.citationPrinsloo, D.J., Taljard, E., Goosen, M. 2022, 'Optical character recognition and text cleaning in the indigenous South African languages', Stellenbosch Papers in Linguistics Plus, vol. 64, pp. 165-187. DOI : 10.5842/64-1-867.en_US
dc.identifier.issn1027-3417 (print)
dc.identifier.issn2223-9936 (online)
dc.identifier.other10.5842/64-1-867
dc.identifier.urihttp://hdl.handle.net/2263/92986
dc.language.isoenen_US
dc.publisherStellenbosch University, Library and Information Serviceen_US
dc.rights© 2021 The authors. This work is licensed under a Creative Commons Attribution 3.0 License.en_US
dc.subjectText cleaningen_US
dc.subjectScanning errorsen_US
dc.subjectGranularity of cleannessen_US
dc.subjectOptical character recognition (OCR)en_US
dc.subjectAfrican languagesen_US
dc.subjectCorpus cleaningen_US
dc.subjectIndigenous languagesen_US
dc.subjectSouth Africa (SA)en_US
dc.subject.otherHumanities articles SDG-09
dc.subject.otherSDG-09: Industry, innovation and infrastructure
dc.titleOptical character recognition and text cleaning in the indigenous South African languagesen_US
dc.typeArticleen_US

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Prinsloo_Optical_2022.pdf
Size:
862.19 KB
Format:
Adobe Portable Document Format
Description:
Article

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
1.71 KB
Format:
Item-specific license agreed upon to submission
Description: