Musau, Wilkister, Christopher Obong’o, Stella Wanjiru, Dickson Otiangala, Mira Emmanuel-Fabula, and Bilal Mateen. 2026. “Co-Designing a Large Language Model Benchmarking Dataset for Primary Care with Nurses in Kenya.” EPIC Proceedings 2025 (1): 265–88. https://doi.org/10.1111/epic.70016.
Download all (2)
  • Figure 1. The SBAR framework, adapted to guide nurses in generating real-world medical scenarios and questions.
  • Figure 2. Phases of the Study

Abstract

Large Language Models (LLMs) are increasingly applied in healthcare, yet their training and evaluation often lack grounding in frontline realities in low-resource settings. By grounding content in nurses’ everyday practice, this work contributes a localized benchmark for LLM training and evaluation and offers a replicable model for ethical, inclusive AI design responsive to care realities in resource-constrained environments. It documents the participatory co-design, curation, and descriptive characterization of a nurse-generated dataset for LLM benchmarking in primary healthcare (PHC) in Kenya. Using human-centred design methods, we trained 145 nurses across three counties to generate real-world clinical scenarios and questions using an adapted SBAR (Situation, Background, Assessment, Recommendation) framework. Through workshops, audio recording and digital submissions, nurses contributed 7,606 scenarios. These scenarios captured decision-making needs spanning clinical management, referral, communication/counselling, and constraints in diagnostics, equipment, and social context typical of PHC. This article details the co-design process, data pipeline, and dataset descriptives; benchmarking methods and results using this dataset are reported separately.

Watch the video presentation here.