Co-designing a Large Language Model Benchmarking Dataset for Primary Care with Nurses in Kenya

Wilkister Musau; Christopher Obong’o; Stella Wanjiru; Dickson Otiangala; Mira Emmanuel-Fabula; Bilal Mateen

doi:10.1111/epic.70016

Introduction

Large Language Models (LLMs) have significantly advanced natural language processing, providing task-agnostic foundations for wide-ranging applications across multiple domains (Ling et al. 2025). These AI systems, increasingly adopted in industries such as healthcare, finance and law, are trained on massive datasets and can generate and translate human language while performing functions tailored to specific needs.

In healthcare, LLMs have been explored for diverse applications, with providers integrating them into different stages of their workflows. Recent studies on applications of LLMs in healthcare report their use in diagnostics, clinical documentation, education and research (Maity and Saikia 2025; Meng et al. 2024; Wang and Zhang 2024). Evidence of LLM adaptation for low-resource and African contexts is beginning to emerge. Alhanai et al. (2024) demonstrated improvements in LLM performance for eight African languages through fine-tuning and cultural adjustments, highlighting the feasibility of deploying these models in multilingual and resource-constrained settings.

However, the effectiveness of LLMs in healthcare depends fundamentally on the quality and representativeness of their training data. LLM training datasets are typically drawn from both public and private sources, including web pages, books, research articles, videos and code repositories (Baack 2024), where many processes behind data construction are opaque despite efforts toward transparency (Hutchinson et al. 2021; Gebru et al. 2021; Rostamzadeh et al. 2022). In high-stakes healthcare settings, accuracy and equity must be safeguarded. LLMs trained on incomplete, biased or low-quality data risk amplifying medical inaccuracies or misrepresenting populations.

The challenge is particularly acute in resource-constrained settings, where models may be trained on data that underrepresents local disease burdens, languages or clinical practices. Research on LLMs in healthcare shows persistent imbalances in representation where studies and datasets are produced in high-income countries, overlooking local realities (Restrepo et al. 2024). Evidence also indicates that LLMs can reproduce demographic biases in medical recommendations, favouring higher income while underserving marginalized populations (Omar et al. 2025). Few studies examine LLM adoption in low-resource healthcare settings, particularly in contexts characterized by high patient volumes and limited support for frontline healthcare workers such as nurses.

These findings underscore widespread inequalities in both the creation and application of LLMs in healthcare. DeWitt Prat et al. (2024) advocate for participatory research and ethnography in Africa to “decolonize LLMs,” emphasizing that widely deployed models risk echoing and extending colonialist practices if they do not support African languages and social practices. Because clinical decisions entail patient-safety, equity, and accountability implications, context-aware evaluation is required. At the same time, there is a growing concern from healthcare workers about potential displacement of human care by AI (Akingbola et al. 2024; Rony et al. 2024), reinforcing the need for frontline-centric approaches (Panch, Mattie, and Atun 2019; Templin et al. 2025).

Kenya has expanded its health workforce, yet nurse density remains about 2.23 per 1,000, below the WHO threshold of 4.45 per 1,000 (WHO, 2016, 2020; Nursing Council of Kenya 2024). In understaffed settings, nurses routinely bridge multiple clinical roles, with workload and quality pressures (Rosser et al. 2022; Mbuthia et al. 2023). These realities motivate evaluating how LLM decision support can aid nursing practice.

Responsible integration of LLMs into healthcare requires systematic evaluation of their safety, reliability and fit within clinical workflows, as clinicians’ trust and adoption depend as much on workflow alignment as on technical accuracy (Chang et al. 2024; Obong’o et al. 2025; Denecke, 2024). Benchmark datasets are central to these evaluations because they provide structured ways to compare model performance, yet their design also determines which clinical challenges and patient groups receive attention (Kwiatkowski et al. 2019; Panch et al. 2020; Mincu and Roy 2022). To avoid privileging convenience over clinical relevance, benchmarks should be designed with clear principles: they need to be large and technically rigorous, include human validation and reflect the diversity of populations, settings and practical realities where LLMs will be deployed (Meng et al. 2024; Al Garadi et al. 2025; Myung et al. 2024; Sourlos et al. 2024).

Against this backdrop, the present study employs ethnographic techniques within a Human-Centred Design (HCD) framework to generate PHC scenarios and questions for context-specific data. HCD is a problem-solving approach that involves developing a deep understanding of the people you are designing for or with (PATH 2022). Applying HCD principles in benchmark dataset creation seeks to ensure that AI tools are considered a method to genuinely align with users’ needs and values, as emphasized in both academic literature and practitioner guidance (Bai et al. 2024; Bhutani 2025). Using participatory methods, we explore when and how nurses seek guidance and support during routine care. Nurses shared scenarios and formulated questions based on their workflows, offering insight into their real-time decision-making needs.

Community engagement has been largely absent in AI healthcare research, with one scoping review finding less than 0.2% of studies reporting meaningful community involvement (Loftus et al. 2024). This absence undermines trust and relevance in AI systems, as scholars point out that without strong participatory foundations, AI tools risk falling short on transparency and ethical alignment (Delgado and Saenz 2023). Evidence from humanitarian settings shows that involving communities directly in AI tool design ensures technology responds to local realities and urgent needs, a principle that applies equally to healthcare contexts (Berditchevskaia, Malliaraki, and Peach 2020). Research highlights both opportunities and challenges in participatory AI design (Birhane et al. 2022) and proposes new frameworks and solution-oriented methods that move beyond traditional approaches (Hossain and Ahmed 2021; Maas 2024; Parthasarathy et al. 2024). This study demonstrates how these participatory approaches can be practically highlighted, implemented with frontline healthcare providers’ experiences and to collect locally applicable data to inform further research and tool development. The aim of this paper is to describe the co-design of scenarios and questions. Results on model performance are reported in a companion paper.

Methodology

Case Study Design

We used a multi-phase design. An initial ethnographic and human-centred design (HCD) foundation informed our adaptation of the SBAR (Situation, Background, Assessment, Recommendation/Request) clinical communication tool for scenario capture and guided technical integrations to preserve linguistic and cultural cues. Following a pilot workshop, we trained nurses and incorporated cultural probing to support self-documentation of real-world clinical scenarios through text and audio. Data collection was done through structured workshops and independent reflection, allowing nurses to share experiences in their own time and words.

Study Setting and Context

The human-centred design case study was conducted over three months (September- November) in 2024 across Kiambu, Kakamega and Uasin Gishu counties in Kenya, representing the central, western and Rift Valley regions, respectively. This geographic diversity provided an opportunity to gather medical scenarios and questions from nurses across various service contexts within Kenya’s health system.

Kiambu County: A peri-urban area bordering Nairobi County with a well-established network of health facilities.
Kakamega County: A predominantly rural area with high population density where healthcare access varies due to infrastructure challenges.
Uasin Gishu County: A semi-urban area with mixed rural and urban populations, especially in Eldoret town, featuring strong health infrastructure due to the presence of the Moi Teaching and Referral Hospital.

Methodological Innovation

Our approach centres provider voices and embeds ethnographic methods within a human-centred design framework to produce a dataset that is inclusive, representative and contextually grounded. English, an official language in Kenya, was selected as the primary language for data collection, informed by Kenya’s nursing education and clinical training practices, which are primarily conducted in English. Documentation of medical files is also done in English, ensuring consistency in medical terminology and alignment with professional practice. Our adaptation of the SBAR framework represents an ethnographic innovation in AI development. Traditionally used for clinical communication, SBAR was reimagined as a tool for eliciting the nurses’ lived experiences and knowledge needs in a structured, familiar format while preserving their idiolect within the final dataset.

Four steps in the framework with an arrow pointing from 1 to 4: 1. Situation: What was going on with the patient? How did they present 2. Background: What context or clinical history was related to the situation. This could include relevant details such as previous incidents, patient history, or other background information. 3. Assessment: What did you think the problem was based on history and observations? Is there additional information you needed? What questions did you have? Were you able to make a diagnosis? Is there someone you needed to speak to or consult after the assessment? 4. Request: Any request/guidance needed by nurse related to the problem identified during assessment, what would you like to do? What information do you need to help this patient?

Figure 1.The SBAR framework, adapted to guide nurses in generating real-world medical scenarios and questions.

Dataset Scope and Target Sample Size

The objective was to collect over 5,000 scenarios and questions from three healthcare settings to contribute to a benchmark dataset. This target number was set to ensure the dataset covers the diversity of primary healthcare and strengthens the reliability of LLM evaluation results while maintaining feasibility for curation and high-quality annotations. The objective of this approach was to capture a wide range of patient scenarios in the selected geographies, thus enhancing both representativeness and localization. Questions were framed as clinical requests, clarifications, or additional information that nurses would typically seek to proceed with patient care. Each nurse individually formatted questions after outlining all relevant patient details using the SBAR framework to reflect the types of inquiries that arise in real-time decision-making. Examples of nurse-generated scenarios presented verbatim in SBAR format are shown in Tables 1, 3, 4 and 5.

Table 1.Scenario and Question Example 1

Situation	The patient arrived complaining of severe chest pain, wheezing and difficulty in breathing.
Background	They presented symptoms for two days, but they only became severe two hours ago; the difficulty in breathing was settled.
Assessment	On examination, the patient’s BP was stable at 100/60, with a pulse of 60. The SpO2 (Oxygen saturation), was 64%.
Request (Questions)	What will I need to manage the patient? Can this be an asthmatic attack? What are the drugs I’m supposed to give? Do I start with nebulization or oxygen therapy, given the oxygen level was low? Then, I proceed only for further evaluation of the patient. We need a chest X-ray to rule out pneumonia or anything else.

Pilot Workshop

We conducted an in-person pilot workshop with eight nurses to test our messaging in preparation for training and data collection to ensure the nurses had the proper guidance to generate scenarios and questions. The nurses tended to list the questions they ask patients during visits (e.g., record-keeping) rather than questions they themselves need answered to make clinical decisions. This observation indicated that our messaging was unclear. We therefore reframed the task explicitly around clinician information needs, introducing the prompt: “What questions do you have before making a decision for this patient?” and emphasising that questions should follow the patient history rather than duplicate it.

To further guide nurses in scenario and question generation, we conducted a journey mapping exercise. Nurses outlined a typical day of service provision, noting the key moments in their workflow, their goals at each stage and the pain points they encountered. Within each key step, we then asked the nurses to share an example of a patient visit. For instance, during triage, they described a typical case from when a patient arrived through to treatment or referral, including what was going on, how the patient appeared and what followed. This process kept examples anchored in their own workflows and care settings (see Table 2). During discussions, nurses identified their motivators and pain points, which served as prompts to encourage sharing of real cases where they had experienced those challenges or achievements. For example, when nurses mentioned successful management of a patient’s condition as a motivator, we asked them to provide typical examples of times when they felt they had successfully managed a patient.

Since examples were initially shared in an unstructured format, the research team worked with the nurses to break them down into SBAR-based scenarios and explicit questions. Collaborative editing helped nurses understand, using their own experiences, what constituted a scenario versus what formed a question. The outputs informed both the tools and the facilitation used in the main workshops.

Table 2.Journey Map of a Nurse’s Workflow (PHC)

Key Step	Goals	Gains/Motivators	Pains/Challenges
Report to nursing station	Arrive on time to begin serving clients	Recognition from supervisor; smooth start to the day	Workload pressure in event of shortage of staff
Waiting bay	Provide health talks to clients who arrive early (before 8:00 am services)	Motivated and educated clients	Limited time for talks when workload is high
Triage	Screen clients, check patient card for history, confirm main complaint	Ability to identify urgent cases quickly	Burnout from high patient volume
Consultation	Learn more about patient’s complaint Consultation depends on facility level: • Nurse (Level 2 & 3) • Clinical Officer (Level 3 & 4) • Medical Officer (Level 4)	Managing a patient’s condition well; patient satisfaction	Pressure from supervisor, distress from wishing you could help a patient in certain ways to show compassion (in case of hunger, homelessness, etc.) but are unable to
Investigations	Send patient for laboratory tests or imaging if required	Access to full diagnostic services; accurate diagnosis and reduced need for referral	Unavailability of services (e.g., x-rays, some lab tests); frustration when unable to confirm diagnosis
Treatment/Referral	Provide care as needed or refer patient to a facility with required services	Successful referral; positive feedback from clients; seeing patient condition improve	Unavailability of services (e.g., x-rays, some lab tests); frustration when unable to confirm diagnosis

We tested both writing and audio recording for documentation. Nurses generally preferred audio, but most found it helpful to take notes first before recording their scenarios and questions. Their notetaking process was possible due to nurses taking turns, giving peers time to reflect and draft ahead of their turn. In addition, our initial hypothesis that nurses could generate 100 questions in five hours (a one-day workshop) was revised to 30–40 scenarios, as nurses found it challenging to reach the target and quality diminished over time in their attempt to reach the target.

The pilot workshop was designed to create an open and reflective space where participants could share their concerns and reflect on their professional knowledge needs. This approach encouraged nurses to look inward and articulate areas where they experienced challenges and required support. Rather than solely focusing on patient queries, they were able to voice over areas of uncertainty and areas of decision-making during their workflows.

Recruitment and Training

After refining the training language and tools, we recruited 145 nurses from the three counties via county health management teams, supplemented by peer referral and hospital contacts. We sampled across facility levels, competencies, and years of experience to ensure diversity.

The recruited nurses attended half-day training sessions in each county incorporating lessons from the pilot. We aligned on project goals with refined language, practised using the SBAR tool to structure scenarios and questions and conducted group feedback sessions where nurses could practice asking questions based on their needs before moving to solving the challenge.

To ensure the nurses’ thought processes were grounded in their lived experiences and workflows, scenario and question examples were only provided as the final resort. Once nurses understood this foundation, they were then given the SBAR framework to guide them in structuring their scenarios.

Following the learning and practice sessions, nurses participated in a dry run simulating the actual data collection workshop. The nurses were placed in groups of 4–6 and each nurse took turns verbally sharing their scenarios and questions with audio recording done on Kobo Toolbox (an open-source digital data collection platform), with the help of a rapporteur. During the dry run, each nurse shared two scenarios to get a sense of how one full cycle of scenario sharing would transition to the other. The dry run established timing and roles; the same groups participated in the actual data collection the following day.

Data Collection Workshops

In each county, data collection started immediately after the training. Nurses were grouped into teams of 4–6, mirroring their training groups. Within each group, nurses took turns recording their scenarios using audio while others prepared their narratives and questions in written notes. This setup allowed rapid successive recording of multiple scenarios and allowed uncertain participants to draw inspiration from peers.

Scenario documentation followed the SBAR framework and was pre-loaded into Kobo Toolbox. Kobo Toolbox enables the design of structured surveys, offline data capture and secure storage and export of responses. Each group had a rapporteur who captured the audio using a simple record-and-stop function within the Kobo app. As nurses became more familiar with the process, some chose to document independently using a direct Kobo link sent to their devices, which allowed them to step away from the group to record their scenarios privately. The self-documentation process laid the foundation for continued data collection through online contributions after the in-person workshops.

To stimulate scenario generation, we provided a non-exhaustive list of common disease areas encountered in nursing derived from the Kenya MOH 711A form (a national reporting tool that captures routine facility-level data on common diseases and service utilization). This list served as a prompt when participants felt stuck rather than prescriptive guidance and nurses were encouraged to prioritize their own clinical experiences.

Each participant was encouraged to generate 30–40 questions during the workshop based on the insights and experience from the pilot testing and training workshop. While the average contribution was around thirty questions per nurse, a few reached the upper target. Rapporteurs emphasized quality over quantity, encouraging nurses to prioritize authenticity and relevance. Kobo Toolbox enabled seamless data capture both online and offline, supporting flexibility across all settings. In one county, the participant contributed ten scenarios in written format during the workshop, with additional written scenarios submitted over the following week through the rapporteur, who compiled them.

The SBAR framework guided participants through each step, starting with first outlining the situation, highlighting how the patient presents during a visit and allowing the nurse to capture information from observation and typical details such as patient demographics. Next, the patient’s background information or chief complaints were collected, including sharing any context related to the symptoms or relevant information needed to understand and evaluate the patient’s situation. Nurses then described their clinical information-gathering process, including examinations or tests conducted, before formulating questions where they would typically seek guidance on diagnosis or treatment decisions.

Table 3.Scenario and Questions Example 2

Situation	A 30-year-old woman presents in the antenatal clinic.
Background	She wants to start her first antenatal visit on examination.
Assessment	She is well oriented, clean and alert. On observation, her blood pressure is 120/70, pulse is 78 and respiration is 18. The mother looks a bit dehydrated, but upon further inquiry, she tells me she has not had anything to drink since the morning and she has not had enough sleep. I tell the mother that she should go for an antenatal profile first before I attend to her and then we can do the physical examination. Once she is back with the antenatal result, her HIV test results turn positive. She asks me what that means. She does not want to start on the antiretrovirals. What should I do? She has refused to commence the anti-HIV regimen. She has vehemently refused.
Request (Questions)	My question is, should I proceed with the other care? Should I give her time to think about it, or should I not give her any care at all and give her more time? I need the guideline as to when a patient comes and they refuse the medication, what should I do next?

Table 4.Scenario and Questions Example 3

Situation	A boy aged 11 years is brought to the facility with a history of having fallen down from a tree and sustained multiple fractures.
Background	No medical or social history of significant importance.
Assessment	He was in severe pain, with a temp of 38.0 and a pulse of 80 bpm. Compound fractures of both upper limbs. Diagnosis is a compound fracture.
Request (Questions)	Since I have no splints, how do I immobilize the limbs before referral and what medication can I give prior to referral?

Table 5.Scenario and Questions Example 4

Situation	A 58-year-old male patient came to the health facility with a history of two weeks of swelling of his legs, fatigue and increased shortness of breath.
Background	He also noticed reduced urine output and appears darker than usual. The patient has a history of hypertension and is on medication.
Assessment	On assessment, BP-170/95 mmHg, P 88 b/m, T-36.7, R-20 b/m
Request (Questions)	What laboratory and radiological investigations can be done? What is the immediate and subsequent management? What specialized referral can be done?

Virtual Data Collection

After each in-person workshop, nurses were invited to continue contributing scenarios and questions online. Using a preloaded SBAR framework on Kobo Toolbox, they could self-document and submit entries either as written text or audio recordings.

This phase lasted seven days, with nurses encouraged to submit as many questions as possible, up to a maximum of 100. Based on earlier feedback, this setup offered greater flexibility, enabling more thoughtful contributions. Nurses were added to county-specific WhatsApp groups where updates, reminders and troubleshooting support were provided throughout the week.

While not all nurses were able to participate in the online follow-up due to competing responsibilities, those who did generally contributed more questions than they had during the in-person sessions. The hybrid approach, combining structured workshops with asynchronous digital submissions, proved responsive to participants’ contexts while maintaining methodological rigor. It also highlighted the potential of digital platforms for supporting ethnographic data collection. Figure 2 summarizes the data collection.

Phases of the study highlighting preparation activities leading up to data collection: 1. Pilot Workshop: testing of tools and processes 2. Recruitment: nurses 3. Training: Introductino to project, tools, and a dry run of data collection 4. Data Collection: In-person workshops and self-documentation to generate scenarios and questions

Figure 2.Phases of the Study

Question Analysis

A random sample of 200 scenarios and their associated questions was reviewed to identify recurring patterns. Similar questions were clustered, and the emerging categories were refined. The resulting patterns were consolidated into thematic categories.

Ethical Considerations

This study received ethical approval from Maseno University IRB number: MSU/DRPI/MUERC/00899/20. All participants were informed about the purpose of the study, their right to withdraw at any time and how their data would be used. Written informed consent was obtained prior to participation. To protect confidentiality, all personal identifiers were removed from the data and scenarios were reviewed to ensure anonymity. Audio recordings were securely stored and transcribed using TurboScribe, with access limited to the research team. Participation was voluntary and care was taken to ensure that no harm came to participants during or after the study. The use of AI tools in data processing was conducted in accordance with ethical guidelines to maintain data integrity and privacy.

Findings

A total of 7,606 scenarios with questions were collected from 145 nurses (see county breakdown in Table 6).

Table 6.County Breakdown of Scenario Format Submission

N	County	Data format				Total
N		Audio	%	Text	%	Total
1	Kiambu	1,314	100	0	0	1,314
2	Kakamega	2,131	67.18	1,041	32.82	3,172
3	Uasin Gishu	0	0	3,120	100	3,120
Total		3,445	45.29	4,161	54.71	7,606

The study involved 145 nurses across the three counties (See table 7), primarily from Kakamega (42.76%) and Uasin Gishu (40.69%) and less from Kiambu (14.48%). Due to close proximity to these main counties, a few nurses came from Bungoma (0.69%) and Elgeyo Marakwet (1.38%) as part of the Uasin Gishu workshop. Of the total, 117 (80.69%) were women and 28 (19.31%) were men, reflecting the global and national gender trends in nursing (WHO, 2020; Kenya Health Workforce Report, 2016).

In terms of professional roles, the majority were PHC nurses (76.55%), only 13.79% served as senior nurses, suggesting a workforce focused on direct patient care (See Table 8). Experience among the nurses ranged from one to thirty-eight years, with an average of 13.56 years, providing a diverse mix of expertise. Regarding educational qualifications, 63.45% hold diplomas, highlighting the emphasis on practical training. 107/145 (74%) of these nurses worked in mid-tier healthcare facilities, such as health centres and sub-county hospitals.

Table 7.Nurse Demographics

Category	Subcategory	N
County	Kakamega	62
	Kiambu	21
	Uasin Gishu	59
	Bungoma	1
	Elgeyo Marakwet	2
Gender	Female	117
	Male	28
Role	Nurse	111
	Senior Nurse	20
	Blank	14
Facility Level	Level 1 (Community)	0
	Level 2 (Dispensary)	21
	Level 3 (Health Centre)	33
	Level 4 (Sub-County Hospital)	32
	Level 5 (County Referral Hospital)	2
	Level 6 (National Referral Hospital)	9
Years of Experience	1–5 years	15
	6–10 years	29
	11–15 years	38
	16–20 years	26
	21–30 years	31
	31-38 years	4
	Missing	2
Certification Level	BSN (Graduate)	32
	Diploma	92
	Other	14
	Missing	7
Area of Specialization	General Nursing	118
	General Nursing & Primary Care	2
	Primary Care	6
	Cardiology	4
	General Nursing & Cardiology	2
	Mental Health & Psychiatric Nursing	1
	Other	9
	Missing	3

Table 8.Nurses per Facility Level

Facility Level/County	Kiambu	Kakamega	Uasin Gishu	Total
Level 1 (Community)	0	0	0	107 (74%)
Level 2 (Dispensary)	5	13	8
Level 3 (Health Centre)	7	25	8
Level 4 (Sub County)	9	20	12
Level 5 (County)	0	7	3	38 (26%)
Level 6 (National/Referral)	0	0	28	38 (26%)
Total	21	65	59	145

From the sampled scenarios, the range of clinical questions raised was broad, with recurring themes in diagnostic uncertainty, patient history interpretation and treatment options in resource-constrained settings. While these clinical categories were prominent, a substantial number of questions addressed non-clinical concerns. This thematic diversity highlights the multifaceted nature of decision-making in real-world contexts. Topics included logistical barriers, social and emotional factors, administrative processes and ethical considerations (See table 9). The scenarios and corresponding questions shared by participants point to a strong demand for decision-support resources that reflect the complexity of real-world clinical environments. Rather than textbook-style cases, the data suggests a preference for tools that accommodate uncertainty, incomplete information and variability in local contexts.

Thematic Analysis of Questions Asked within the Scenarios

The table below outlines examples of high-level themes highlighted within the questions. Some questions may underscore multiple themes. The themes presented in this paper are derived from a sample of the scenarios.

Table 9.Themes Highlighted within Questions

Theme	What it entails	Question examples
General Patient Management	Day-to-day care, procedures, decision-making, nursing actions	How do I monitor this patient? What should I do? How should I manage this patient?
Diagnosis/ Condition Identification	Identifying what’s wrong with patient, interpreting symptoms and differential diagnosis	What could be causing low BP? Can this be an asthmatic attack? Is this mother having an incomplete abortion?
Investigations and Diagnostics	Questions on which tests to run, how to interpret the results and confirming conditions	Should I do the urinalysis first before administering the drugs? Does this patient require an x-ray? What investigations should I do?
Medication and Treatment	Questions about drugs, dosage, alternatives, fluids and wound care	Can I give IV fluids? What type of analgesic is appropriate for him now that we don’t know if he has internal injuries? Should I give BCG?
Clinical Guidelines and Protocols	Use of SOPs, national guidelines, or protocol sequencing	Should I reduce the BP first or blood sugars? What infection control protocol should be prioritized? What are we going to do about dog bite wounds that have bleeding?
Tools, Equipment and Technology	Use and availability of machines, tools, or equipment	What formula do I use to determine the percentage of the burn? Which method do I use to remove the bead before sending it to an ENT specialist? And now that we didn't have a machine for RBS, will this lady benefit from a referral?
Referral and System Navigation	Referrals, transfers and system-level responsibilities	Do they need a referral to a bigger facility? Will we help him at our level or just refer him? Which percentage of burns can I treat at my facility?
Communication and Patient Interaction	Communication with patients, families, or authorities	How do I explain the sudden death to the relatives? Which authority do I inform concerning the management of this patient? Should I report to the children’s department office?
Education and Knowledge Sharing	Health education for patients and training for other providers	Do I need to educate this patient on hygiene? How do I counsel this patient to avoid buying over-the-counter drugs? Which exercises should I teach the patient for quick recovery?
Resource and Contextual Challenges	Questions arising from lack of resources, limited equipment, or affordability issues	For the premature baby, the NBU is not available. What can we do? If Salbutamol is not available in the facility, what else can I use? If it’s not available at that time, what can I do?
Social and cultural issues	Question where nurses grapple with social realities and interpersonal dynamics that influence patient care	How was I going to tackle the case bearing in mind that the husband doesn't want his status to be disclosed to the wife? What if the mother fails to go to hospital on delivery and has a home delivery? Should I disclose the information to the client before further intervention, as this was a precious baby to the mother?

About half of the nurses for each workshop arrived at the workshop with pre-written scenarios in their notebooks. This early preparation followed guidance from the earlier dry-run session, where participants were encouraged to reflect on clinical situations in advance. This proactive engagement contributed to a smooth start and early momentum in scenario sharing.

The use of audio recordings gave nurses the freedom to share their experiences without pressure to summarize or follow rigid templates. Scenario narratives varied in length and depth. Some were shared with detailed context, while others remained brief and to the point. In many cases, nurses diverged from the expected SBAR format, sometimes introducing clinical questions early in the scenario or integrating them into broader reflections. The audio recordings captured these nuances including hesitations, emotions and contextual reasoning, offering richer insight into the nurses’ clinical thinking.

All nurses in each workshop expressed that the activity was challenging, noting that it was one of the first times they had been asked to externalize and articulate their internal clinical thought processes. Many shared that the experience helped them recognize the types of questions and judgments they make instinctively or subconsciously while delivering care.

More than half of the nurses also found it difficult to generate their initial individual target scenarios. This challenge stemmed from the fact that they typically resolved issues quickly and independently in their daily work, which made it challenging to recall or share questions that could inform scenarios. This insight helped us adjust the language we used, to which end we encouraged nurses to share questions even in situations where they had ultimately answered them on their own. Consequently, some nurses suggested generating questions online to provide more time for inspiration.

Insights on Workshop Facilitation

The team adopted simplified language when introducing the project, emphasizing “clinical decision support” rather than complex terms like AI or LLM. Framing the workshop to build knowledge systems that could potentially help nurses in African contexts helped generate buy-in.
The prompt, “What questions do you have before making a decision for your patient?” effectively anchored the concept of question generation among participants."
Trained facilitators were advised not to use SBAR abbreviations when introducing the tool, but instead to present the full terms situation, background, assessment and request. This move aimed to minimize risk of misinterpretation based on participants’ pre-existing understanding of the format and use of the tool.
When introducing the scenario outline, facilitators asked participants to draft scenarios and related questions before showing examples. This approach surfaced early gaps and allowed for real-time clarification.
Most nurses quickly grasped the process of question generation during the in-person workshops. However, output slowed considerably after about ten scenarios. By the twentieth scenario, many were visibly fatigued or stalled. Facilitators encouraged participants to continue making submissions of even very simple scenarios to maintain momentum, yet they stressed quality over quantity.
Group work proved helpful. In groups of two to three, nurses who understood the process helped others adapt. This peer-learning effect reinforced the potential value of small group facilitation for future iterations.
Participants were reminded to submit a mix of simple and complex scenarios. Some initially assumed that only difficult clinical situations were being sought.
Fatigue emerged as a critical challenge. Facilitators encouraged teams to take breaks as needed, recognizing that the repetitive nature of the task affected engagement levels over time.

Documentation Process

Some rapporteurs were unable to keep up with the speed of verbal input; others worked from audio recordings rather than live typing.
Some rapporteurs allowed nurses to document several scenarios before verbally sharing them for entry into Kobo. This writing phase appeared to support deeper reflection and improved quality in subsequent scenarios.
There was a clear need for rapporteurs to be well-versed in the process, including expected inputs and how to guide participants.
Some participants posed clinical questions they would ask patients (e.g., “What is their LMP?”) that required clarification from facilitators regarding the purpose of the task.
Language use during data collection presented challenges, as some participants interspersed Swahili into their scenarios and diverse accents occasionally affected clarity in the recordings. To support transcription accuracy and consistency, rapporteurs were reminded to encourage the use of English throughout the sessions.

General Feedback from Nurses

Initial engagements with nurses revealed a strong culture of self-reliance in clinical decision-making. As one participant noted, “We rarely have unanswered questions; we always find solutions.” This norm prompted facilitators to clarify that the project was also interested in understanding the clinical questions nurses grapple with even when they eventually resolve them independently.

Additionally, some participants questioned the involvement of clinicians in the process, asking questions like, “Why are answers to our questions being provided by clinicians?” The team clarified that while clinicians play a role in the answer phase, the design process welcomes both nurse- and clinician-generated input in order to reinforce the relevance of nurses’ voices in clinical decision-making.

Discussion

We co-designed and curated a nurse-generated PHC dataset from three Kenyan counties using a human-centred design approach. By eliciting scenarios with an adapted SBAR structure, the dataset captures decision points that nurses face in routine care, including clinical management, diagnostics, referral, communication and resource constraints. This study addresses critical gaps in contextual awareness and the representation of frontline realities in the potential use of LLMs in healthcare (Hamid and Brohi 2024; Talukdar and Biswas 2024). Our participatory HCD approach, specifically involving frontline nurses, directly tackles the pervasive issue of LLM training and evaluation lacking grounding in actual clinical workflows and resource-constrained PHC settings. The scenarios and questions collected contribute a unique benchmark dataset for testing and evaluating LLMs for PHC.

The participatory approaches used in this study anchored the dataset in the lived experiences of nurses. Beyond ensuring contextual relevance, these methods fostered peer learning and mutual support throughout the process. Audio recording allowed participants to create rich, unfiltered scenarios that preserved the natural flow of nurses’ thought and decision-making processes in real-time, though it required significant effort to clean and organize. Notably, the recordings register how clinical decision-making often begins earlier than expected; sometimes during the background information stage, before any assessment is performed. This observation highlights the potential need for decision-support tools that engage nurses at earlier stages of the care process. This participant-driven approach highlights the importance of flexible methodologies when designing clinical decision support tools intended to operate in dynamic and resource-constrained settings (Anibal et al. 2025).

The contribution is significant given the central role of benchmarks: they address limitations LLMs often suffer from such as outdated knowledge, lack of contextual awareness and bias in recommendations (Denton et al. 2020). These lacunas are amplified in low-resource settings, where local disease burdens and infrastructural constraints are unaccounted for in LLM training data. The dataset generated in this project responds with a clinically grounded benchmark that captures the realities of PHC in Kenya and enables evaluation across both technical accuracy and clinically meaningful dimensions such as information quality, reasoning, clarity of expression and potential safety implications (Tam et al. 2024).

Strengths

Involving nurses in the early stages of dataset creation provided an unexpected benefit: it created space for nurses to reflect on their own clinical reasoning. We mapped nurses’ journeys and workflows to adapt the SBAR framework, which then guided nurses in capturing key elements of clinical scenarios and follow-up questions. The mapping process brought out pain points in their workflows and informed how they would ask questions based on what information they needed to provide care. Some participants noted the challenge of articulating their decision-making processes, underscoring a gap in how this thinking is externalized in routine care. This finding has implications for tool design, professional development and reflective practice.

Our focus on nurses as primary dataset contributors is also significant, as their voices are often excluded from AI development. Most clinical decision-support tools are developed using physician-centred data and workflows, as evidenced by scoping reviews showing a scarcity of tools designed for non-physician users (Harada et al. 2021), field studies exposing mismatches with local nurse practices (Wang et al. 2021) and reviews noting a lack of nursing-specific AI tools (Wei et al. 2025). Nurses make critical real-time decisions, often in high-pressure, low-resource settings. By showing how nurses’ reasoning can be captured and organized, this study helps bridge the gap in tool development.

This study also highlights the need to incorporate community perspectives beyond healthcare providers. Tools supporting care delivery across health system levels must account for community values and patient expectations, especially where AI trust is nascent or non-existent (d’Elia et al. 2022; Dankwa-Mullan 2024).

Limitations

This study has six key limitations. First, data collection occurred in only three Kenyan counties and findings may not be generalizable across Kenya or other health systems. Expanding geographic coverage in future studies would strengthen contextual relevance and tool applicability. Second, thematic analysis focused on identifying broad recurring themes of the questions rather than subthemes, in a sample data size. As a result, some finer distinctions in knowledge requirements may not have been highlighted for the whole dataset. Incorporating themes for all the questions in future analyses could provide a more detailed map of the specific areas where frontline healthcare providers shared questions on. Third, the dataset contains only text-based data and does not incorporate multimodal data such as images or clinical charts. In real-world PHC settings, decision-making often involves interpreting visual cues (e.g., skin and eye appearance) and reviewing patient records. The absence of these modalities may constrain the dataset’s ability to fully represent clinical complexity and limit the scope of LLM evaluation to textual reasoning. Future iterations of this dataset could benefit from integrating diverse data types to better reflect these. Fourth, the use of English as the primary language for data collection may have constrained the way scenarios and questions were expressed. While participants were encouraged to use English, it is possible that some nuances typically conveyed in Kiswahili or local dialects were not fully captured. Fifth, data collection modalities varied across counties, with one county using text-only documentation and another audio-only. This variation reflected logistical considerations, including facilitator capacity and available infrastructure. Sixth, participant compensation may have influenced participation, which should be considered when interpreting the findings.

Conclusion

This study demonstrates how participatory methods can create contextually grounded benchmark datasets. By grounding the data in the experience of nurses in the Kenyan context, the results of this case study highlight the diverse nuances of what providing care in PHC entails, with scenarios spanning beyond clinical questions. Drawing on nurses’ workflows and lived experiences, the process ensured that the resulting data reflected real-world complexity and contextual relevance elements often overlooked in “general-purpose” datasets. This dataset may be used to benchmark LLMs for primary health care, characterise model failure modes and inform safer evaluation protocols in low-resource settings, thereby strengthening the currently limited evidence base on LLM evaluation in such contexts (Rutunda et al. 2025).

Future studies should expand geographical reach and explore additional methods such as asynchronous or digital contributions to complement in-person workshops and reduce fatigue, especially when building large datasets. Overall, this work highlights the essential role of HCD in shaping AI tools that are ethical, useful and responsive to the realities of care in diverse settings.

Acknowledgements

We recognize with deep appreciation the late Brian Taliesin, whose insight, generosity of thought and intellectual leadership were instrumental to both the study and the development of this paper. We remain immensely grateful for the commitment and dedication he brought to the project.

We gratefully acknowledge the support of Moi University (Kenya) for their contribution to data collection in this study.

Family planning	Maternity/Safe Deliveries (Maternal complications)	TB	Maternity- PMCT
MCH – ANC/PMCT	STI	Child Health and Nutrition Information System (CHANIS)	PAC Services
VCT*	Blood Safety	ART	DTC
MOH 711A Form – Disease burden reporting indicators

Disease	Description
Malaria	One of the leading causes of morbidity and mortality, particularly in children under five. Nurses frequently diagnose and treat malaria, provide preventive measures and educate communities on the use of insecticide-treated nets.
Respiratory Infections	Upper respiratory tract infections, including pneumonia, are prevalent, especially in young children. Nurses manage these conditions through diagnosis, treatment and health education
Gastrointestinal Diseases	Diarrheal diseases, often caused by contaminated water and poor sanitation, are common. Nurses provide rehydration therapy, treat infections and educate on hygiene practices.
HIV/AIDS	With a significant number of people living with HIV, nurses are heavily involved in testing, counselling and managing antiretroviral therapy (ART). They also educate on prevention strategies. TB remains a major public health issue. Nurses are involved in screening, diagnosing and managing TB cases, including the provision of DOTS (Directly Observed Treatment, Short course).
Maternal and Child Health Issues	Nurses deal with antenatal care, childbirth, postnatal care and childhood immunizations. They also manage complications like preeclampsia and malnutrition.
Non-Communicable Diseases (NCDs)	There is an increasing burden of NCDs like hypertension, diabetes and cardiovascular diseases. Nurses provide routine monitoring, education and management for these conditions.
Skin Infections and Conditions	Skin conditions such as scabies, fungal infections and dermatitis are commonly seen, especially in rural areas.
Sexually Transmitted Infections (STIs)	Apart from HIV, other STIs like syphilis, gonorrhoea and chlamydia are frequently managed by nurses.
Malnutrition	Particularly in children, malnutrition remains a significant health challenge. Nurses provide nutritional support, education and manage associated complications.
Typhoid Fever	Due to poor sanitation and contaminated water, typhoid fever is prevalent, particularly in urban slums and rural areas. Nurses are involved in diagnosing, treating and educating communities about prevention.
Helminthiasis (Worm Infections)	Parasitic worm infections, including schistosomiasis and soil-transmitted helminths, are common in areas with poor sanitation. Nurses manage these infections through treatment and preventive education.
Measles	Despite vaccination efforts, outbreaks still occur, especially in areas with low immunization coverage. Nurses play a crucial role in vaccination campaigns, treatment and managing complications.
Eye Infections	Conditions like conjunctivitis and trachoma are common, particularly in rural and semi-arid regions. Nurses provide treatment and educate on hygiene practices to prevent spread.
	Malnutrition-Related Conditions: Beyond general malnutrition, conditions like kwashiorkor and marasmus are seen in children. Nurses manage these cases through nutritional rehabilitation and education.
Mental Health Issues	Mental health conditions, such as depression, anxiety and substance abuse, are increasingly recognized in primary healthcare. Nurses provide initial assessment, basic counselling and referrals to specialized care.
Dengue Fever	In certain regions, especially during rainy seasons, dengue fever outbreaks occur. Nurses are involved in early detection, treatment and public health education on mosquito control.
Trauma and Injuries	Due to road accidents, domestic violence and occupational hazards, trauma cases are common. Nurses provide first aid, wound care and referrals for further treatment.
Asthma and Other Chronic Respiratory Conditions	With increasing urbanization and pollution, respiratory conditions like asthma are more frequently encountered. Nurses manage these through medication, education on avoiding triggers and monitoring.
Brucellosi	Particularly in pastoralist communities where people are in close contact with livestock, brucellosis is a concern. Nurses diagnose and treat this zoonotic disease and provide education on prevention

Co-designing a Large Language Model Benchmarking Dataset for Primary Care with Nurses in Kenya

Abstract

Introduction

Methodology

Case Study Design

Study Setting and Context

Methodological Innovation

Dataset Scope and Target Sample Size

Pilot Workshop

Recruitment and Training

Data Collection Workshops

Virtual Data Collection

Question Analysis

Ethical Considerations

Findings

Thematic Analysis of Questions Asked within the Scenarios

Insights on Workshop Facilitation

Documentation Process

General Feedback from Nurses

Discussion

Strengths

Limitations

Conclusion

Acknowledgements

References

Appendices

A. Disease burden reporting indicators: MOH 711A form

2. Common diseases in Kenya encountered by nurses in PHC

Co-designing a Large Language Model Benchmarking Dataset for Primary Care with Nurses in Kenya

Abstract

Introduction

Methodology

Case Study Design

Study Setting and Context

Methodological Innovation

Dataset Scope and Target Sample Size

Pilot Workshop

Recruitment and Training

Data Collection Workshops

Virtual Data Collection

Question Analysis

Ethical Considerations

Findings

Thematic Analysis of Questions Asked within the Scenarios

Insights from Scenario and Questions Sharing

Insights on Workshop Facilitation

Documentation Process

General Feedback from Nurses

Discussion

Strengths

Limitations

Conclusion

Acknowledgements

References

Appendices

A. Disease burden reporting indicators: MOH 711A form

2. Common diseases in Kenya encountered by nurses in PHC