Research Datasets

Access our publicly available research datasets for your projects

CC01.0

Dataset of Crops part one

This dataset consists of five classes of data in the training set: cassava, sugarcane, maize, cashew, and coffee images. This is the first part, which consists of five out of seven classes used in crop classification projects; train data, weeds, and unknown images are in the second part of this data, which also has validation data and test data. Additionally, when combining this data with its second part, they were prepared for classification, and they have three splits: train, validation, and test, with the data in the train set being augmented. There are 4074 images across all seven classes, with …

Available: January 03, 2024

Download link

CC0 1.0

Data of Crops Part Two

This is the second part of the data "https://doi.org/10.7910/DVN/J0OS9R". It consists of two classes that remained from the training set's data: weeds and unknown. Plus, the validation and test data with all classes. Please, to use it, combine the first part with all training classes with this data.

Available: January 03, 2024

Download link

CC BY 4.0

Multilingual Parallel Text Corpora for East African Languages

This is a partial multilingual parallel corpora of 5 East African languages. The dataset contains an English text corpus that has been translated into five East African languages: Acholi, Runyankore, Luganda, Lumasaba, and Swahili. (2023-12-05)

Available: December 05, 2023

Download link

CC BY 4.0

Coffee and Cashew Nut Dataset

The datasets presented in this work consist of high-resolution images of coffee and cashew plants acquired using Unmanned Aerial Equipment (UAV) equipment from small and large-scale farms across Uganda. Images range approximately between 10 MB and 12 MB in size, approx. 4000 by 3200 pixels in dimension and 72 pixels/in in Dots per inch (DPI). Each image is annotated with multiple bounding boxes, each enclosing an object of interest. Each image is accompanied by metadata, including the date (timestamp) and the geographic location (latitude and longitude) where it was captured.

Available: November 10, 2023

Download link

CC0 1.0

Makerere Luganda Agricultural Text Data

The dataset consists of sentences in the Luganda language that solely pertain to the agricultural domain. These sentences cover a wide range of topics within agriculture, such as farming, animal breeding, crop cultivation, crop storage and yield, marketing of produce, and environmental aspects. The dataset was created to provide a high-quality agriculture domain-specific dataset for the Luganda language that can be used in different use cases, including; Machine translation for agriculture, Language modelling, Topic modelling for agriculture, and Named Entity Recognition for agriculture.

Available: May 02, 2023

Download link

CC BY 4.0

Sentiment Tagged Parallel Corpus for Luganda and Swahili

This dataset contains 10,000 parallel sentiment-tagged sentences. English sentences were translated to both Luganda and Swahili. The translations were done by language experts and professional translators in collaboration with researchers at Makerere University. All sentences were tagged with a sentiment code. The sentiment tags were applied with respect to the English sentences.

Available: March 24, 2023

Download link

CC BY 4.0

Kiswahili Monolingual Corpus

This dataset contains 100,000 Kiswahili sentences. We want to thank the team at the Makerere AI and Marconi Labs at Makerere University, TAVODET Youth Development (TYD) Innovation Incubator, Ai Kenya, Maseno University, United States International University-Africa (USIU-Africa), and Kabarak University who have worked tirelessly and collaboratively to source, create and prepare this Kiswahili monolingual dataset. This dataset was created with support from Lacuna Fund. For more information on how the dataset was created, please check out our paper published at AfricaNLP.

Available: March 22, 2023

Download link

CC BY 4.0

Acoli Monolingual Corpus

Acoli is a very low-resourced language spoken in parts of Northern Uganda. This dataset contains 40,037 Acoli sentences. The sentences were collected and evaluated by Acoli linguists with the collaboration of teams at Marconi Research and Innovation Lab and Makerere AI Lab from Makerere University. For more information on how the dataset was created, please check out our paper published at AfricaNLP. This dataset was created with support from Lacuna Fund.

Available: March 22, 2023

Download link

CC BY 4.0

Lumasaba Monolingual Corpus

Lumasaba sometimes known as Lugisu is a Bantu language spoken in the Eastern part of Uganda. This dataset contains a total of 39,999 sentences. The sentences are split into two separate files. One file contains 20,764 sentences from the Northern dialect and another one contains 19,235 sentences from the Southern dialect. This dataset was compiled by a team of Linguists and researchers from the Makerere AI and Data Science Research Lab and Marconi Research and Innovation Lab at Makerere University. This dataset was created with support from Lacuna Fund.

Available: March 22, 2023

Download link

CC BY 4.0

Luganda Monolingual Corpus

This dataset contains 100,000 Luganda sentences. Luganda is a Bantu language and is one of the major languages spoken in Uganda. This dataset was compiled by researchers at the Makerere AI and Data Science Research Lab and Marconi Research and Innovation Lab. We want to thank the Department of African Languages, Makerere University and the Ekibiina Ky'Olulimi Oluganda (EKO) for the work done in curating the dataset. We would like to thank the Buganda Kingdom for partnering with us and also for the support towards the collection of this Luganda monolingual text corpus through its agencies. We would also like …

Available: March 22, 2023

Download link

CC0 1.0

Makerere University Cassava Image Dataset

The dataset was created to provide an open-source and well-curated image dataset showing diseased and healthy cassava leaf images from Uganda. This will be used by data scientists, researchers, the wider machine learning community, and experts from other domains to conduct research into automating the identification and diagnosis of cassava crop diseases. The image dataset was collected across three different classes: Healthy, Cassava Brown Streak Disease (CBSD), and Cassava Mosaic Disease (CMD).

Available: August 18, 2022

Download link

CC0 1.0

Makerere University Beans Image Dataset

This beans dataset was created to provide an open and accessible, well-labeled, sufficiently curated image dataset. This is to enable researchers to build various machine learning experiments to aid innovations that may include; bean crop disease diagnosis and spatial analysis. This beans image dataset was collected across three different classes: Healthy, Angular Leaf Spot (ALS), and Bean Rust.

Available: July 20, 2022

Download link

Creative Commons Attribution 4.0 International

The Makerere Gendered Corpus: A Gendered English to Luganda Parallel Corpus

This English-Luganda parallel sentence corpus consists of gendered examples created by a team of researchers from Makerere AI Lab at Makerere University with a team of Luganda teachers, students and freelancers. The collaborative work which involves generating English sentences under CC-0 and translating these sentences using a crowdsourcing, iterative and opensource approach was done using Pontoon an opensource Translation Management System built by Mozilla. This is a corpus of 1,000 parallel sentences.

Available: January 17, 2022

Download link