Loading [Contrib]/a11y/accessibility-menu.js

This website uses cookies

We use cookies to enhance your experience and support COUNTER Metrics for transparent reporting of readership statistics. Cookie data is not sold to third parties or used for marketing purposes.

Skip to main content
EPIC
EPIC Proceedings
  • Menu
  • Articles
    • Case Studies
    • Keynotes
    • Papers
    • Special Sessions
    • All
  • For Authors
  • Editorial Board
  • About
  • Issues
  • search
  • RSS feed (opens a modal with a link to feed)

RSS Feed

Enter the URL below into your favorite RSS reader.

https://proceedings.epicpeople.org/feed
P-ISSN 1559-890X
E-ISSN 1559-8918
Case Studies
Vol. 2025, Issue 1, 2025January 19, 2026 PDT

From Chaos to Innovation: Understanding Products and People in a Non-Deterministic World

Katie Johnson, Larry Becker, Lindsey DeWitt Prat, Megha Goel, Gavin Lew,
AI product developmentethnographyhealthcarehuman-AI collaborationhuman-computer interactionproduct strategyprototypestopic segmentationusabilityUX
Copyright Logoccby-nc-nd-4.0 • https://doi.org/10.1111/epic.70006
EPIC Proceedings
Johnson, Katie, Larry Becker, Lindsey DeWitt Prat, Megha Goel, and Gavin Lew. 2026. “From Chaos to Innovation: Understanding Products and People in a Non-Deterministic World.” EPIC Proceedings 2025 (1): 206–23. https://doi.org/10.1111/epic.70006.
Download all (1)
  • Figure 1. Four-week crossover design testing introduction methods (weeks 1-2) and expert types (weeks 3-4). Each family experienced all conditions in alternating order to enable direct comparison while controlling for individual differences.
    Download

Error

Sorry, something went wrong. Please try again.

If this problem reoccurs, please contact Scholastica Support

Error message:

undefined

View more stats

Abstract

In an era tempted by rapid in-market iteration, this paper demonstrates the critical role of ethnographic methods for understanding complex human interactions with non-deterministic LLMs. Through a longitudinal Wizard-of-Oz study of “Nova,” a simulated AI family wellness assistant, we exposed limitations of traditional usability methods in high-stakes, multi-participant contexts. Our methodological approach documented organizational chaos in group AI interactions, identified optimal patterns in human-AI and Human-in-the-Loop (HITL) collaborations, and traced the nuanced process of AI relationship formation and its impacts on user reflection and behavior. Ethnographic insights led directly to innovations including dynamic topic segmentation technology and multiple patent applications. This work demonstrates the indispensability of ethnographic methods for understanding AI systems within authentic social contexts, where human expertise and support remain vital.

Watch the video presentation here.

Introduction

In the current era of software and AI development, it can be alluring to suspend ethnographic research, which may be deemed too expensive in capital or time, in favor of rapid experimentation and iteration of prototypes in-market. The temptation has intensified with the emergence of synthetic users and AI-generated personas that promise scalable simulation of human behavior for exploratory research (Anthis et al. 2025). Recent advances suggest LLMs can accurately model human cognition across diverse experimental contexts (Binz et al. 2025). These approaches offer apparent cost efficiencies by eliminating the complexities of recruiting and coordinating with actual participants. Yet synthetic substitutes come with different costs: they can hallucinate, flatten identity groups, and abstract away from the lived experiences that shape authentic human–technology interactions (Agnew et al. 2024; Wang, Morgenstern, and Dickerson 2025). Hoy and Van Hofwegen (2024) note that the promise of frictionless, automated research has led some product teams to treat ethnography as an obstacle to speed. Their work, like ours, argues that this moment clarifies rather than diminishes the value of ethnographic practice.

We demonstrate this through a study of Nova, a simulated AI family wellness assistant designed to help multiple family members collaboratively track wellness, set goals, and access human support through shared digital channels. We define wellness broadly as multidimensional well-being encompassing physical, emotional, social, and behavioral health (Wellness Alliance, n.d.), and recognize that family wellness emerges through relational dynamics and shared goals rather than individual metrics alone.

Family wellness represents a relational, high-stakes domain where automated systems must navigate ambiguity and build trust over time. These characteristics make it an ideal case for examining how ethnographic methods can engage directly with behavioral and social complexity. To manage the non-deterministic nature of LLMs, we engaged in longitudinal observation of natural interactions, creating opportunities for researchers to witness the human experience of these products both live and over time. Our findings point to a critical need to supplement rapid iteration with ethnographic research to deeply understand more than a simplistic jigsaw of product-market fit between human consumers and these emerging technologies. Crucially, we need to understand what humans are trying to accomplish, their preferences for interaction, and the needs and motivations behind their choices. The imperative becomes particularly urgent in domains characterized by ambiguity and high stakes—contexts where multiple approaches can address user needs and where outcomes depend heavily on cultural, relational, and temporal factors that emerge over extended engagement. Rather than simply documenting what families do or say, our approach treats ethnography as creative engagement with how AI-mediated wellness might unfold within the sensory and relational fabric of everyday family life. This approach aligns with principles of design ethnography, emphasizing learning and knowing with people through incremental modes of engagement that accumulate understanding over time rather than extracting insights through discrete observations (Pink 2021; Raats, Fors, and Ebbesson 2023).

Product development and testing once depended on products targeting the delivery of predictable experiences. Research teams could gather actionable insights from iterative small sample usability studies because products were designed to remain consistent across interactions (Dove et al. 2017). As Lew and Schumacher (2020) emphasize, this model assumes technology can be tested and refined in isolation from lived user experience. That paradigm breaks down when applied to AI systems, as LLMs generate infinite possible outcomes. Each interaction varies not just between users but within individual user experiences; dependent variables become independent ones. Yang et al. (2020) identify this as AI’s distinctive design challenge: adaptive capabilities and outputs that diverge at massive scale.

Family wellness deepens the complexity of human-AI interaction by embedding technically unpredictable systems within high-stakes, multi-participant settings shaped by social roles, care responsibilities, and emotional asymmetries (Brandtzaeg, Skjuve, and Følstad 2022; Thomas, Liu, and Umberson 2017; Wagner et al. 2025). Household dynamics are marked by persistent tensions between authority and dependency, independence and coordination, routine and disruption. Traditional usability methods often fail to account for these socio-relational conditions, which fundamentally influence how technology is adopted and adapted within families. Designing for domestic environments requires attention to distinct interactional demands (Sengers et al. 2005). Multi-party conversational systems must respond to shifting group dynamics, track interpersonal cues, and manage the conversational disorder that builds as multiple users interact in overlapping, nonlinear ways across time. LLMs continue to struggle with this complexity (Lee et al. 2025; Mao et al. 2024). Their technical limitations collide with the irreducible realities of family life, where individual wellness is entangled with collective well-being and the impact of technological mediation extends beyond any single user to the relational system as a whole.

Research in these settings requires methods that stay close to lived experience and account for the instability of AI behavior over time. Effective approaches must navigate the sociotechnical entanglement of AI systems with human values and everyday social environments (Johnson and Verdicchio 2025). Human-centered AI frameworks call for a contextual understanding of how people and AI systems interact (Hussain et al. 2024; Shneiderman 2020), while human-in-the-loop (HITL) methodologies offer ways to preserve agency in mediated decision-making (Mosqueira-Rey et al. 2023). Wizard-of-Oz prototyping supports this aim by enabling systematic observation before full implementation while still allowing for controlled experimentation (Kraus et al. 2024; Porcheron, Fischer, and Reeves 2021). In Wizard-of-Oz methodology, researchers simulate system functionality through human operators behind the scenes, allowing testing of user interactions before technical implementation.

Against this backdrop, we designed a study to explore how ethnographic methods could meet the demands of AI products built on non-deterministic systems. Our approach centered on three methodological innovations, each aligned to core product hypotheses and developed in response to the limitations of traditional usability methods. First, we created a digital hub to observe multi-participant interactions with the simulated AI assistant we code-named Nova. Second, we enabled relationship formation between families and the AI assistant through sustained engagement over time. Third, because the planned product called for the service to pass users to a human expert when the LLM reached the boundary of its usefulness, we integrated HITL support to the simulated AI conversations to examine how and when human expertise should complement automated systems. In the following sections, we outline the full study design and examine how each of these methodological choices surfaced insights that would have been missed using conventional techniques. This approach reflects a commitment to studying practice as it unfolds, not as it is hypothesized, what Ingold (2013) describes as attention to the reflexive, relational, and temporal processes through which meaning and behavior take shape.

Study Design

The study utilized a mixed-methods, phased approach, at the heart of which was a simulated prototype of a new family wellness assistant product designed for use by multiple members of a family. Nova was simulated through Wizard-of-Oz methodology, where researchers operated behind the scenes to generate AI-like responses rather than using actual AI technology. At the time of the study launch, the product was yet to be functionally prototyped by the design and engineering team.

Table 1.Study Methodology Overview
Phase Method Sample Duration/Format
Baseline In-depth interviews with artifact sharing 10 families (9 completed), 2 parents per household (1 single parent) 90-minute sessions
Longitudinal Simulated AI interaction with embedded experiments Same 9 families 4 weeks, weekly 1-hour sessions plus ongoing Slack interaction
Validation Quantitative survey 100 dyads (two-person units; separate sample) Online survey ranking wellness areas and activities

For the qualitative work, we recruited ten families, and with the exception of one single-parent household, each family was represented in the study by two parents living in the same household. Nine of the ten families completed the study. Families were screened for household income above $125,000, children aged six to ten, and family size, including other children above the age of ten or other dependents that lived with them (i.e., parents, siblings). 100 additional dyads, or two-person units (separate from the qualitative sample), responded to a quantitative survey designed to increase confidence in the qualitative findings.

Phase 1: Establishing Family Wellness Baselines

Prior to baseline interviews, participants shared digital captures of a week’s worth of calendar entries, to-do lists, planner pages and the other scheduling/organizational systems they used to manage their lives. These photographed and screenshotted artifacts ranged from digital calendars to color-coded checklists, multiple notebooks, wall calendars, and grocery tracking lists on fridges.

During the initial ninety-minute in-depth interviews, researchers guided each family dyad to reflect on the artifacts they had shared, how well their current support systems worked, and opportunities they saw for specific, deeper, underlying needs that a premium support. Each participant self-scored across six dimensions of wellness (Relationship, Parenting, Self, Social, Recreation/Fun, and Finances) on a one-to-seven scale, provided context for their baseline scores, and described effective support they envisioned for overcoming challenges in their identified problem areas. Researchers probed how participants ranked and prioritized dimensions allowing family goals and concerns to surface. The researchers then worked with each participant to establish concrete goals in two areas where they felt they could move the needle by the end of the study.

Phase 2: Four-Week Longitudinal Interaction Protocol

The research team guided each of the families through four weeks of longitudinal research designed to study how the simulated AI with complementary human-expert (i.e., HITL) service could support each family’s individual and collective wellness journeys. Families engaged with Nova, the simulated AI assistant, through dedicated Slack channels over four weeks. The simulation employed a novel three-researcher mediation protocol that enabled real-time response generation while maintaining conversation quality and participant safety (detailed in the following section). Families were instructed to use their Slack channel as a comprehensive dashboard, leaving to-do items, partner communications, and other family information in the main chat without using threaded responses, enabling Nova to reference accumulated information during interactions.

In weekly one-hour sessions, participant dyads interacted with Nova, beginning with check-ins on items left in the chat and probing family wellness areas established during baseline interviews. During these sessions, participants were offered various HITL capabilities so that optimal attributes of human support could be discovered. We tested two key questions across four weeks: First, should human support be offered automatically or only when families ask for it (weeks 1—2)? Second, what type of human support works better: professional wellness coaches or peer supporters from the community (weeks 3–4)?

Four-week crossover design diagram showing how ten families (divided into two groups of five) experienced four conditions in alternating order. The diagram illustrates two research questions tested across four weeks: 'How accessible is the HITL?' (weeks 1-2) and 'Who is the HITL?' (weeks 3-4). Families 1-5 followed the sequence: Week 1 - Condition A (Participant asks for HITL), Week 2 - Condition B (Service proactively offers), Week 3 - Condition C (Well-being coach), Week 4 - Condition D (Community member). Families 6-10 experienced the same conditions in reverse order during weeks 1-2 and 3-4, enabling direct comparison while controlling for individual differences.
Figure 1.Four-week crossover design testing introduction methods (weeks 1-2) and expert types (weeks 3-4). Each family experienced all conditions in alternating order to enable direct comparison while controlling for individual differences.

Weekly wellness scorecards captured participants’ ongoing self-assessment of individual and family wellness, and week-to-week satisfaction with the family assistant service. Midpoint and final interviews allowed each dyad to reflect on their experience. In the final interview, participants provided feedback on prototyped product features developed from study insights and reassessed their wellness scores to measure goal achievement.

Phase 3: Quantitative Validation at Scale

One hundred dyads completed surveys ranking family wellness areas and wellness activities to test the validity of taxonomies that emerged during the qualitative research, with the aim of providing statistical confidence in our ethnographic findings.

Overall, the mixed methods, multi-phase study design enabled systematic study of phenomena invisible to traditional usability methods: relationship formation with AI over time, multi-participant interaction complexity, and optimal human-AI collaboration patterns in non-deterministic systems. The methodological innovations that made this study possible are detailed in the following section.

Methodological Innovation: Findings Followed Three Core Design Choices

Our study design addressed head-on the stochastic nature of LLMs and associated products. As a result, we surfaced findings that would simply have been missed had we relied on typical methods, and would have arrived too late had we waited for technology to advance sufficiently to support our product direction. Specifically, non-deterministic outputs mean that the product experience changes not only between subjects, but between sessions with the same subject; therefore, session-based testing provides only a single, imperfect window into the true user experience and delivers data that cannot be extrapolated even for the same user, let alone across users. Moreover, to wait for usage data after the product ships carries a risk of determining too late that longitudinal or unique user experiences have unintended, unanticipated consequences because the almost infinite possible combinations of experiences had not been properly imagined, catalogued, or deliberately designed.

At the heart of our design was a low-cost, easily replicable simulation method that enabled systematic study of non-deterministic AI interactions. We understood that the weekly interactions with the family assistant service could, theoretically, have been built or functionally prototyped for testing. However, the team decided that simulation would yield more relevant insights, more efficiently, for three primary reasons, each of which: (1) mapped directly to hypothesized primary features of the planned product; (2) was in turn reflected in the research design; and (3) ultimately yielded findings that would otherwise have been missed.

The following sections detail our three core methodological choices: designing a digital hub for multi-participant AI interaction, enabling relational experience development over time, and integrating HITL support. For each choice, we explain the methodological challenge it addressed, the ethnographic innovation it required, and the unique insights it revealed that traditional usability methods could not have captured.

Digital Hub: Studying Multi-Participant AI Interactions

Methodological Challenge

In the planned family assistant product, users would work collaboratively on individual and collective wellness in a single digital hub where family members would interact with each other, an LLM assistant, and a human expert (as a HITL), serving one of three roles, detailed below. While there are many digital hubs where humans collaborate together, such as Slack, Discord, and others, none of them have yet integrated an omnipresent AI / LLM assistant. Even today, 1:many conversations remain technically challenging for LLMs, which struggle to manage shifting participants, overlapping inputs, and topic changes without error. As described in the Study Design section, we used Slack to simulate the product’s conversational interface. Families accessed their Slack channels via a mix of devices (e.g., laptops, desktops, and mobile phones) during their one-hour allotted time slot and throughout the day depending on availability and context.

Ethnographic Innovation

Our ethnographic approach involved sustained participant observation within a naturalistic digital environment over multiple weeks. We used Slack as a cost-effective, readily available canvas for the simulated AI interactions, with each family assigned a dedicated channel to communicate in a group chat with each other and with the service. Critically, we did not allow threading in these Slack-based simulations for two ethnographic reasons. First, this restriction mirrored the limitations of LLM-based chat applications at the time of the study, preserving ecological validity. Second, by eliminating threading, we reduced the possibility of the researchers mediating the simulation overlooking a participant request.

Findings

The Wizard-of-Oz simulation and extended engagement created space for families’ natural communication patterns to emerge. Through authentic observation, rather than controlled task performance, we discovered that multi-participant AI interactions produced a kind of organizational chaos: overlapping inputs, competing topics, out-of-sequence replies, and conversational drift that compounded over time. This insight led directly to patentable innovations that would have been impossible to identify through controlled testing scenarios.

A well-organized, easily navigable chat experience quickly became important—and underscored a key product design challenge. During the early interactions with the simulated service, we observed families experimenting with Nova’s capabilities, often throwing a wide variety of requests at the service to ascertain the specific usefulness of the product. This echoed findings from earlier research this team had done around other subscription based assistant services.

While we describe this behavior as experimentation, for most families it was not ordered, but instead often bordered on chaotic. In one family, topics that were discussed in a twenty-minute time span ranged from searching for a new home to planning activities for an upcoming vacation to potty training, with each parent leading and weighing in on different topics at different times.

Between the families’ energetic, group-based exploration of service capabilities, and the “flat,” unthreaded nature of the simulation space, the conversations between families and Nova became noisy, quickly. Participants often “spoke over each other” (i.e., typed simultaneously, sometimes on competing topics) and topics swelled and contracted, sometimes assuming their own hierarchical structure. For example, as a family discussed vacation activities, the conversation necessarily shifted to include weather forecasts, dining recommendations, and local customs, including slang. Some subtopics were truly ephemeral in nature, only relevant to the vacation at hand; others, however, contained values or preferences that were more permanent in nature, like food preferences, that families could reasonably expect the service to recall for future interactions.

Further, while we detail below how participants became increasingly comfortable with the service over time, time also had an opposite effect on the navigability of the chats themselves. Within and across the scheduled sessions, as each dyad added new topics while referring back to older topics, the conversations became increasingly difficult to track. It became clear that the digital hub within the actual product would need to not only accommodate LLMs as well as humans, but would also need to provide organization and transparency to mediate the chaotic nature of ongoing multi-participant chat.

Our study’s repeated identification of the overwhelming complexity and “chaos” of family chats with a simulated LLM, even in a synchronous capacity, identified a crucial gap in how digital communication interfaces were designed: these interfaces were modeled after 1:1 simultaneous conversations. Put differently, they were not designed to truly support how humans communicate asynchronously, and certainly not how multiple humans communicate in a single channel, asynchronously.

After our study closed, this insight, and the associated chat logs where families traversed multiple unrelated topics inside mere minutes, were brought to a brainstorming session with the engineering team alongside the provocation: how might we allow families to communicate and collaborate naturally, asynchronously over time? Based on the feedback received in the studies, the engineering team quickly created a patentable technology, “dynamic topic segmentation,” that empowered families to speak naturally while it automatically organized topics and next steps on the back end. To that end, the research and design team developed a digital lo-fi prototype manifestation of how dynamic topic segmentation would look and feel, known as Chatmarks.

Because the participants had developed rich familiarity and situational awareness not only with the Nova platform but also with their own conversations, their feedback enabled us to develop a lo-fi way of validating our solutions. The prototypes were customized with topics for each family and subsequently tested with the participants. Participants were palpably relieved, even joyful, as they witnessed how the prototypes provided a fundamental and needed improvement to the conversations they’d experienced

With strong encouragement that we could address the fundamental needs shown by our study, the solutions were approved for patenting. Chatmarks, dynamic topic segmentation, and several other related concepts were immediately fast tracked for patent approval, and in total, four related patent applications were filed on behalf of this team. The method of developing semi-custom lo-fi prototypes based on chat interactions has been reused by the team in subsequent studies and continues to be a reliable method for examining future interaction patterns in stochastic products.

Relational Experience: Observing AI Relationship Formation Over Time

Methodological Challenge

From past work the team had done together, we hypothesized that relationships would begin to form between the families and the simulated LLM in the chat. We expected that the formation of these relationships would take time, as moving from trial to committed engagement with any service requires multiple satisfying interactions. We also anticipated these relationships would add business value, as interactions that were less transactional and more relational could help users move into higher value areas of the service.

We also recognized that even a single misstep on the part of an (inherently unpredictable) bot could do irreparable damage to its relationship with an individual participant or their family writ large. These factors, combined with the aforementioned inability of any existent LLM to engage in the multiplayer chats required by the product design, pointed clearly to a decision to simulate the bot’s presence in our Digital Hub, and to do so over a series of interactions, so that relationships would in fact have time to form and useful patterns could be observed.

Ethnographic Innovation

The responses that participants received as they chatted with “Nova, an AI assistant” were actually created by three researchers role-playing, in a Wizard-of-Oz fashion in Slack. Throughout the interactions, each time a request from a participant was received: Researcher 1 made a relevant query of ChatGPT and/or Gemini; Researcher 2 quickly edited and formatted the AI’s response; Researcher 3 tailored the edited response to reflect previous chats with this participant, as well as relevant participant-specific data gleaned during the baseline phase of the research, and then responded to the participant in the Slack as Nova.

Through practice and repetition, these responses were created and posted in about a minute’s time, with multiple requests often in play, but to maintain a credible simulation, participants were asked to be patient with an evolving prototype that would be slower to respond than other AI they had likely encountered. This approach enabled ethnographic observation of relationship formation processes that unfold gradually, something that would be difficult to capture in single-session studies.

Findings

By observing families over weeks rather than hours, we discovered that AI relationships create space for self-reflection and behavior change that compounds over time, sparking positive change and influencing product design. Traditional usability testing would have missed both the relationship formation process and its downstream effects on family wellness practices.

Despite the usability challenges of the synchronous multiplayer conversations, we still observed families form relationships with Nova, the simulated AI agent. As we hypothesized, time was required and the study’s longitudinal approach mattered: by the end of the second week of interactions, once families had successfully mapped Nova’s capabilities, they began to connect with the “bot,” as evidenced by a series of behavior changes including, but not limited to: referring to Nova with gendered pronouns; making jokes with Nova, comparing and contrasting relationships with Nova between themselves and their partner; and asking Nova questions about itself.

These behaviors align with research on “parasocial relationships” (Noor, Rao Hill, and Troshani 2021) noting how people can form one-sided attachments with people or characters. In fact, some participants built such an identifiable relationship with Nova that when, in week 4 of the study, the research team switched out Nova’s human simulators due to scheduling conflicts, participants not only noticed but were concerned. Some pointed out that Nova’s interaction style had noticeably changed. The emerging relationships we observed also raise questions about how users project onto and interpret AI agents over time. While outside the scope of this study, future work could draw on phenomenology and psychoanalytic anthropology (e.g., Ihde 1990; Fuchs and De Jaegher 2009; Fischer 2009) to explore how people mediate self–other dynamics in long-term AI interaction.

As the study progressed, we also saw how opportunities afforded by the study design’s combination of time, carefully simulated AI interactions through three-researcher Wizard-of-Oz simulation, and human support consisting of professional coaches and role-playing members of the research team allowed the team to surface the power of the AI-human relationships to stimulate reflection.

During the baseline interviews, we’d learned about two major obstacles each family faced: The first obstacle was one of mindset: all participants saw wellness as a zero-sum game. We heard that working to improve any aspect of their lives necessarily required them to detract from another area. This left families feeling stuck in a fixed mindset about their own ability to make themselves feel better.

The second obstacle was behavioral: Lack of partner discussions on wellness had led to stagnation. In their typical day to day life, the families weren’t creating time and space to have discussions about their family and individual well-being . As a result, there was a missed opportunity to apply the collective intelligence of the partners to the challenge of getting unstuck.

Throughout the study, families responded to unique opportunities to identify and communicate both mundane and higher-order tasks and needs:

  • The facilitated conversations during baseline allowed participants to identify gaps in knowledge about their partner’s current state and goals, and to reflect on what was possible.

  • Across the four weeks of longitudinal work, participants found that their completion of a weekly scorecard self-assessments allowed them to notice trends and track changes.

  • Revealing partners’ scores in the initial and midpoint interviews allowed participants to assess family wellness with each other, and sparked conversations about goals and needs.

  • As the couples engaged in the Slack chats with the simulated AI, participants were able to learn about the high and low points of their partners’ wellness, and about support that was needed, but had not been communicated directly.

  • As detailed below, the HITL sessions provided invaluable reinforcement, helping participants maintain their commitment to reflection and their specific wellness goals.

For participants, the sum result of these factors was observable progress toward their identified goals (e.g., quality family time, self-improvement). Participants reported positive change across multiple dimensions of well-being and attributed the success to the reflection, the “AI,” and the HITL sessions. Significantly, the zero-sum mindset they exhibited at the beginning of the study was shown to be changeable; specifically, over the four weeks of the study, when individual wellness scores improved commensurate family wellness scores improved as well.

In addition to validating the need for thoughtfully orchestrated simulation within the research design, these findings on the power of reflection directly impacted product design. “Guided Weekly Reflection” became a key feature built into the functioning product, supporting family efforts and activating individual as well as collective intelligence to pursue well-being.

Human(s) in the Loop: Identifying Optimal AI-Human Collaboration Patterns

Methodological Challenge

As noted earlier, the planned product called for the service to offer both LLM and human support. Simulation of the chat experience allowed the research team to determine appropriate and accurate boundaries and the preferred transition time and interval to human agents. Specifically, the combination of simulation and time allowed the team to answer three key questions:

  1. How should humans be introduced?
    As one of two experiment conditions, families were either able to request or were offered human support during each of their Nova sessions.

  2. What kind of human should be introduced?
    As a second experiment condition, families were exposed to three types of human support: Generalists, who were introduced to participants with little to no context as to who they were or what exactly they could do; well-being coaches, who were introduced to participants as professionally experienced in providing families with wellness support; and community members, who were introduced to participants as having relevant points in common with them (e.g., parenting, similar cultural backgrounds).

  3. How much time would the human need to create impact?
    Throughout the study, all interactions between participants and the various human supporters were time-tracked.

Ethnographic Innovation

Our ethnographic approach enabled systematic observation of spontaneous preferences for human versus AI support as they emerged naturally during family interactions. Rather than asking families hypothetically about their preferences, we observed their actual behavior when offered different types of human assistance during authentic problem-solving moments.

Findings

Through observing families’ actual behavior rather than stated preferences, we discovered that proactive human support was a valued differentiator and even minimal interaction time increased perceived value. Traditional survey or interview methods would have missed the gap between what families say they want and what they actually value when experiencing the service.

Here, too, our study design allowed us to surface findings that would otherwise have been missed and which applied across the three HITL types, regardless of whether human support was requested or offered. Families seemed to be reluctant to request human support—eight out of nine families did not ask for a human unprompted. However, they mentioned valuing speaking to a human when the option was offered to them, and seven out of nine times families agreed to meet with a human when prompted, signaling that the “AI” was recognizing the boundaries of its capabilities.

When participants preferred human support over AI, they consistently pointed to the value of lived experience. In total, we observed six advantages that participants associated with human interaction:

  1. Credible opinions. Families were interested in a human expert’s (as a HITL) experience-backed preferences or recommendations such as neighborhood recommendations from someone who shared their background, ethnicity, or gender.

  2. Connection. Families sought a feeling of connection with others who had been through what they were experiencing and they highlighted for us the inability of AI to credibly appropriate emotions, physical conditions, and sensations.

  3. Acquired expertise. Families wanted to speak to humans who, either through education and credentials or simply repeat experience, knew their craft inside and out, clearly delineating a preference for how humans really do things over how they are supposed to be done.

  4. Faster to correctly comprehend. Because they had “been there,” humans were seen to understand requests better and at a faster rate than AI. Embodied experience, in other words, helped build relationships between families and the HITL and build confidence in the service.

  5. Higher-order decision support. Humans were seen to have a crucial role in decision-making. While users found AI to be a valuable tool in creating a shortlist of options, they ultimately felt that human expertise was necessary to evaluate selections and make informed decisions.

  6. Accountability for follow-through. Many families told us—unprompted—that digital nudges were too easy to ignore, while even brief conversations with a human created a greater sense of obligation. Half the families specifically volunteered that human support made them feel more accountable. These comments underscored why our study prioritized HITL design as a critical factor in effective behavior change.

These six themes capture the core advantages participants associated with human support. But beyond their stated preferences, we also observed subtle patterns in how human interaction shaped perceived value, emotional resonance, and behavior change, especially related to time, accountability, and decision-making.

Even with the average length of HITL sessions being less than fifteen minutes, participants told us that conversation length was "just right and “not too short or too long for a busy day, while providing a lot of information”. And in this short time families spoke with an HITL, they noted finding value in their conversation. No families indicated the need for a longer HITL interaction.

The HITL offering also increased the perceived value of the service. When asked about pricing, families reported a mean expected value of $82/month with HITL, compared to $37 without. This clear differential helped the business prioritize HITL integration on the roadmap, especially in a market where many of the current LLM-based AI offerings are available at little to no cost to the end consumer.

Discussion and Conclusion

Our study highlights patterns that deserve closer attention in future simulations like this, particularly to understand how and why to build AI products that support user understanding, accountable use, and responsible engagement, particularly in contexts where emotional reliance or decision-making is involved. Given that LLMs and other AI technology operate as a black box, careful study is necessary before deploying these systems with real users. Ethnography helps clarify how users interpret and relate to opaque AI systems, and what kinds of human support make them feel trustworthy.

The Wizard-of-Oz metaphor proves useful here, and not only because of our methodological approach. Galli (2024) writes that the Wizard should not be dismissed as a fraud, but should be recognized as a guide who operates behind the curtain with care, domain expertise, and a commitment to serving others. In our study, researchers played this role: monitoring interactions, tailoring responses, tracking goals, and safeguarding the experience. Their labor was invisible but integral to the simulation’s quality.

Through extended observation of how families actually communicate, we saw how multi-participant AI interactions produced what we’ve characterized as organizational chaos: overlapping inputs, competing topics, out-of-sequence replies, and conversational drift that compounded over time. This finding led directly to patentable innovations that would have been difficult to identify through traditional methods.

Our longitudinal approach also uncovered the complexity of relationship formation with AI over time. Families often attributed intelligence or intentionality to Nova even when responses were inconsistent. This interpretive generosity is not unique to our participants; it reflects a broader pattern in which conversational AI systems are treated as reliable partners or confidants, even when their underlying mechanisms remain opaque. Recent reporting has documented instances of users becoming emotionally entangled with chatbots, with outcomes ranging from distorted reality to harm (Tufekci 2025). Such risks are not readily observable without the kind of context that rich ethnography provides. Understanding when reliance becomes risky, and how trust is misplaced, demands methods attuned to context and evolving interpretation. Our study surfaced prosocial outcomes, but this same relational dynamic may carry different risks in other settings, including emotional dependency or habitual overuse.

While our primary focus in this paper has been methodological and design-centered, we note that participants’ interactions with the simulated AI often surfaced complex relational dynamics that merit further exploration. Specifically, the evolving relationships between participants and the LLM assistant revealed patterns of projection, trust attribution, and emotional resonance that stretch beyond conventional notions of “wellness.” These interactions suggest space for future inquiry into how users interpret, personify, or emotionally invest in AI systems over time.

We need not look far into the history of human factors research to find conversations about function allocation and the Fitts List, a foundational guide for determining which tasks are best handled by humans versus machines (de Winter and Dodou 2014). While arguments about function allocation have evolved over the past 70+ years, we observed a persistent and perhaps permanent need for humans in the loop, particularly as systems move into domains that are more emotional and more human in nature. Our finding that families wanted their identities and cultural differences recognized and matched with their HITL underscores added complexity in designing AI+HITL services. The AI’s non-deterministic behavior combines with the lack of a one-size-fits-most approach for the human component.

Our research also revealed a strong desire for proactivity from the assistant. For bots to meet this expectation, they will need to know users deeply, at times approaching or even equaling the user’s own self-knowledge. This points to a broader need for long-term ethnographic research that tracks how users interact with technology over time and across contexts. While proactivity may be welcomed, decision support must also be studied ethnographically to understand how decisions can and should be made with agentic AI assistance. Technology may speed up decision-making, but some decisions, especially those involving others, require human evaluation and timing. Behavioral data may show that users prefer to act quickly with AI support, but only qualitative methods can reveal when slower, shared, or more reflective decisions are needed. We might look to adjacent domains like aviation, where pilot-automation dynamics in the glass cockpit offer clues for balancing speed and oversight (Sarter and Woods 1992). Recent ethnographic research on digital flight assistants, for example, examines how pilots conceptualize intelligent systems in highly automated environments. The study found that skepticism toward AI tools was often rooted in past frustrations with digital assistants, and that effective adoption depends on careful attention to environmental context, such as noise, cognitive load, and multi-user coordination (Gosper et al. 2021). These concerns echo many of our own findings around trust, role clarity, and shared use in family-based AI systems. From a product perspective, then, designing for responsible engagement may mean rethinking metrics like DAU and MAU to better capture meaningful interaction. It may also involve incorporating “graceful exits” into bot design, adjusting default patterns to reduce unnecessary use, and creating guardrails for more sensitive contexts, including use by minors.

As progress in machines, LLMs, and AI continues apace, ethnography will be a key ingredient in identifying why certain tasks and interactions should remain partially or exclusively human in nature. Families were keen to ensure that machines stayed “in their lane” when it came to discussing lived experiences. Going forward, the design of AI-mediated services should make room for human insight shaped by context and time. Such deliberate pacing aligns with “slow innovation” approaches that prioritize responsible development at the micro-level of organizing projects (Steen 2021). Those who bring contextual understanding to how systems behave and are understood, including ethnographers, care professionals, and other researchers, should not be relegated to the background. Some decisions and relationships cannot be fully automated. In domains like wellness and caregiving, the goal is not full automation but better support. To get there, we need to study not only what systems can do but what people actually need.

References

Agnew, W., A. S. Bergman, J. Chien, M. Díaz, S. El-Sayed, J. Pittman, S. Mohamed, and K. R. McKee. 2024. “The Illusion of Artificial Inclusion.” In Proceedings of the 2024 CHI Conference on Human Factors in Computing Systems (CHI ’24), 1–12. New York, NY, USA: ACM. https:/​/​doi.org/​10.1145/​3613904.3642703.
Google Scholar
Anthis, J. R., R. Liu, S. M. Richardson, A. C. Kozlowski, B. Koch, J. Evans, E. Brynjolfsson, and M. Bernstein. 2025. “LLM Social Simulations Are a Promising Research Method.” In Proceedings of the 42nd International Conference on Machine Learning, Vancouver, Canada. PMLR 267. https:/​/​doi.org/​10.48550/​arXiv.2504.02234.
Google Scholar
Binz, M., E. Akata, M. Bethge, F. Brändle, F. Callaway, J. Coda-Forno, P. Dayan, et al. 2025. “A Foundation Model to Predict and Capture Human Cognition.” Nature, 1–8. https:/​/​doi.org/​10.1038/​s41586-025-09215-4.
Google Scholar
Brandtzaeg, P., M. Skjuve, and A. Følstad. 2022. “My AI Friend: How Users of a Social Chatbot Understand Their Human–AI Friendship.” Human Communication Research 48 (3): 404–29. https:/​/​doi.org/​10.1093/​hcr/​hqac008.
Google Scholar
Dove, G., K. Halskov, J. Forlizzi, and J. Zimmerman. 2017. “UX Design Innovation: Challenges for Working with Machine Learning as a Design Material.” Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems, 278–88. https:/​/​doi.org/​10.1145/​3025453.3025739.
Google Scholar
Fischer, M. M. 2009. Anthropological Futures. Duke University Press.
Google Scholar
Fuchs, T., and H. De Jaegher. 2009. “Enactive Intersubjectivity: Participatory Sense-Making and Mutual Incorporation.” Phenomenology and the Cognitive Sciences 8 (4): 465–86. https:/​/​doi.org/​10.1007/​s11097-009-9136-4.
Google Scholar
Galli, G. 2024. “Peeling Back the Curtain to Unmask the Wizard of AI: Considering the Collaborative Relationship between Non-Technical Subject Matter Experts and Artificial Intelligence,” January. https:/​/​doi.org/​10.2139/​ssrn.4694869.
Google Scholar
Gosper, S., J. R. Trippas, H. Richards, F. Allison, C. Sear, S. Khorasani, and F. Mattioli. 2021. “Understanding the Utility of Digital Flight Assistants: A Preliminary Analysis.” In Proceedings of the 3rd Conference on Conversational User Interfaces, 1–5. ACM. https:/​/​doi.org/​10.1145/​3469595.3469627.
Google Scholar
Hoy, T., and J. Van Hofwegen. 2024. “Experts in the Loop: Why Humans Will Not Be Displaced by Machines When There Is ‘No Right Answer.’” Ethnographic Praxis in Industry Conference Proceedings 2024 (1): 196–224. https:/​/​doi.org/​10.1111/​epic.12203.
Google Scholar
Hussain, M., I. Iacovides, T. Lawton, V. Sharma, Z. Porter, A. Cunningham, I. Habli, et al. 2024. “Development and Translation of Human-AI Interaction Models into Working Prototypes for Clinical Decision-Making.” Designing Interactive Systems Conference, 1607–19. https:/​/​doi.org/​10.1145/​3643834.3660697.
Google Scholar
Ihde, D. 1990. Technology and the Lifeworld: From Garden to Earth. Indiana University Press. https:/​/​doi.org/​10.2979/​3108.0.
Google Scholar
Ingold, T. 2013. Making: Anthropology, Archaeology, Art and Architecture. Routledge. https:/​/​doi.org/​10.4324/​9780203559055.
Google Scholar
Johnson, D. G., and M. Verdicchio. 2025. “The Sociotechnical Entanglement of AI and Values.” AI & SOCIETY 40 (1): 67–76. https:/​/​doi.org/​10.1007/​s00146-023-01852-5.
Google Scholar
Kraus, M., S. Klein, N. Wagner, W. Minker, and E. André. 2024. “A Pilot Study on Multi-Party Conversation Strategies for Group Recommendations.” ACM Conversational User Interfaces 2024, 1–7. https:/​/​doi.org/​10.1145/​3640794.3665569.
Google Scholar
Lee, S., M. Kim, S. Hwang, D. Kim, and K. Lee. 2025. “Amplifying Minority Voices: AI-Mediated Devil’s Advocate System for Inclusive Group Decision-Making.” ACM. https:/​/​doi.org/​10.1145/​3708557.3716334.
Google Scholar
Lew, G. S., and R. M. Schumacher. 2020. AI and UX: Why Artificial Intelligence Needs User Experience. Apress Publishers, a division of Springer-Nature. https:/​/​doi.org/​10.1007/​978-1-4842-5775-3.
Google Scholar
Mao, M., P. Ting, Y. Xiang, M. Xu, J. Chen, and J. Lin. 2024. “Multi-User Chat Assistant (MUCA): A Framework Using LLMs to Facilitate Group Conversations.” https:/​/​doi.org/​10.48550/​arXiv.2401.04883.
Mosqueira-Rey, E., E. Hernández-Pereira, D. Alonso-Ríos, J. Bobes-Bascarán, and Á. Fernández-Leal. 2023. “Human-in-the-Loop Machine Learning: A State of the Art.” Artificial Intelligence Review 56 (4): 3005–54. https:/​/​doi.org/​10.1007/​s10462-022-10246-w.
Google Scholar
Noor, N., S. Rao Hill, and I. Troshani. 2021. “Artificial Intelligence Service Agents: Role of Parasocial Relationship.” Journal of Computer Information Systems 62 (5): 1009–23. https:/​/​doi.org/​10.1080/​08874417.2021.1962213.
Google Scholar
Pink, S. 2021. Doing Visual Ethnography. 4th ed. Sage.
Google Scholar
Porcheron, M., J. E. Fischer, and S. Reeves. 2021. “Pulling Back the Curtain on the Wizards of Oz.” Proceedings of the ACM on Human-Computer Interaction 4 (CSCW3): 1–22. https:/​/​doi.org/​10.1145/​3432942.
Google Scholar
Raats, K., V. Fors, and E. Ebbesson. 2023. “Tailoring Co-Creation for Responsible Innovation: A Design Ethnographic Approach.” In 14th Scandinavian Conference on Information Systems. Vol. 15. https:/​/​aisel.aisnet.org/​scis2023/​15.
Google Scholar
Sarter, N. B., and D. D. Woods. 1992. “Pilot Interaction with Cockpit Automation: Operational Experiences with the Flight Management System.” The International Journal of Aviation Psychology 2 (4): 303–21. https:/​/​doi.org/​10.1207/​s15327108ijap0204_5.
Google Scholar
Sengers, P., K. Boehner, S. David, and J. Kaye. 2005. “Reflective Design.” In Proceedings of the 4th Decennial Conference on Critical Computing: Between Sense and Sensibility, 49–58. https:/​/​doi.org/​10.1145/​1094562.1094569.
Google Scholar
Shneiderman, B. 2020. “Bridging the Gap between Ethics and Practice: Guidelines for Reliable, Safe, and Trustworthy Human-Centered AI Systems.” ACM Transactions on Interactive Intelligent Systems (TiiS) 10 (4). https:/​/​doi.org/​10.1145/​3419764.
Google Scholar
Steen, M. 2021. “Slow Innovation: The Need for Reflexivity in Responsible Innovation (RI).” Journal of Responsible Innovation 8 (2): 254–60. https:/​/​doi.org/​10.1080/​23299460.2021.1904346.
Google Scholar
Thomas, P. A., H. Liu, and D. Umberson. 2017. “Family Relationships and Well-Being.” Innovation in Aging 1 (3): igx025. https:/​/​doi.org/​10.1093/​geroni/​igx025.
Google Scholar
Tufekci, Z. 2025. “Musk’s Chatbot Started Spouting Nazi Propaganda. That’s Not the Scariest Part.” The New York Times, July 11, 2025. https:/​/​www.nytimes.com/​2025/​07/​11/​opinion/​ai-grok-x-llm.html.
Wagner, N., M. Kraus, W. Minker, D. Griol, and Z. Callejas. 2025. “A Survey on Multi-User Conversational Interfaces.” Applied Sciences 15 (13): 7267. https:/​/​doi.org/​10.3390/​app15137267.
Google Scholar
Wang, A., J. Morgenstern, and J. P. Dickerson. 2025. “Large Language Models That Replace Human Participants Can Harmfully Misportray and Flatten Identity Groups.” Nature Machine Intelligence 7 (3): 400–411. https:/​/​doi.org/​10.1038/​s42256-025-00986-z.
Google Scholar
Wellness Alliance. n.d. “The Six Dimensions of Wellness.” https:/​/​www.wellnessalliance.org/​resources-and-tools/​nwis-six-dimensions-of-wellness.
Winter, J. C. F. de, and D. Dodou. 2014. “Why the Fitts List Has Persisted throughout the History of Function Allocation.” Cognition, Technology & Work 16 (1): 1–11. https:/​/​doi.org/​10.1007/​s10111-011-0188-1.
Google Scholar
Yang, Q., A. Steinfeld, C. Rosé, and J. Zimmerman. 2020. “Re-Examining Whether, Why, and How Human-AI Interaction Is Uniquely Difficult to Design.” Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems, 1–13. https:/​/​doi.org/​10.1145/​3313831.3376301.
Google Scholar

Attachments

Powered by Scholastica, the modern academic journal management system