Developing Visual Conventions for Explainable LLM Outputs

PI: Dr. Matthew Peterson
Co-PI: Helen Armstrong
Research Assistants: Ashley Anderson, Kayla Rondinelli, Rebecca Planchart & Kweku Baidoo,

The NC State Laboratory for Analytic Sciences Team led by TPM Christine B.

RESEARCH OBJECTIVE: Discover innovations in interface design that empower intelligence analysts to efficiently validate large language model (LLM) outputs.

Click here to interact with the demo and read more about our research.

To learn more about this project, specifically around trust calibration, check out our article in the journal, Visible Language.

As we consider a future of human/machine teaming, we seek an interface environment through which humans might augment their cognitive abilities with the use of AI. We look to a scenario in which human insight might combine with AI support to leverage the unique strengths of both.

In our project “Developing Visual Conventions for Explainable LLM Outputs in Intelligence Analysis Summaries,” (Conventions) our research team positioned human sensemaking in this complementary human/AI space.

When users can’t fully grasp how automated systems work, their willingness to rely on the systems depends heavily on trust, particularly in complex scenarios when comprehensive understanding of an AI system is impractical. Trust is established in the literature as a crucial factor in human-machine teaming with AI. Humans either trust these systems too much (more than the system capabilities warrant) or they trust them too little (dismissing them quickly), even if the results might be accurate. Appropriate trust calibration via user interface design provides one possible answer.

Our project “Developing Visual Conventions for Explainable Large Language Model (LLM) Outputs in Intelligence Analysis Summaries,” (Conventions) considered how uncertainty might be effectively communicated to analysts through visual conventions. We then embedded these visual conventions inside a larger sensemaking interface through which we explore several methods of trust calibration including verification, multi-agent dialog and context and user oversight. We began with an interactive demo (MAVs) that uses the common interface convention of a dashboard. We then created three future facing speculative interfaces to explore how AI capabilities might be leveraged in the future to support appropriate trust calibration between an intelligence analyst and an LLM.

This material is based upon work done, in whole or in part, in coordination with the Department of Defense (DoD). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the DoD and/or any agency or entity of the United States Government.

Grayscale dashboard like interface displaying the summary output of an LLM

Interact with the demo and read more about our research.

In the first of these future facing speculative interfaces we considered verification: how might the interface utilize query recommendations, nudging, verification and source investigation to support trust calibration?

In the second speculative interface, we dug into the concept of multi-agent dialog, designing the interaction as a conversational user interface that engages the human user with multiple agents. Envisioning different functionality in the form of distinct agents helps users to form a clear mental model of the capabilities of the AI system.

In the third, we considered context. We explored how the interface might intuitively respond to the needs of specific users, customers and/or storylines to decrease cognitive load and tailor responses. This interface supports trust calibration by both providing more relevant data to the user and enabling the user to adjust the analytic and visualization sensitivities to match the context of the scenario.

Research Outcomes

Situating Uncertainty: The workflow involves the following steps Query Initiation: The analyst formulates a  query with optional detailed parameters. Retrieval Augmented Generation (RAG):  RAG optimizes the Large Language Model  (LLM) output by directing it to the most  relevant information. Query Response Generation: The LLM generates a response based on the  optimized query. Initial Output: The LLM presents the generated results to the analyst. Response Evaluation: A secondary LLM evaluates these results, assessing data uncertainty. Visualization: The system displays visualizations of the evaluated results for  the analyst to review. Feedback Loop: The analyst provides feedback, enabling iterative refinement and interaction with the LLM.

Where does uncertainty reside? When is uncertainty irreducible? We focused on the uncertainty in the summary output itself rather than in the reality it represented.

Meaning Uncertainty: Misinterpreting word sense for technical, cultural, or uncommon terms, or for jargon. Reference Uncertainty:Mistaking associations from demonstratives (“those”), adverbs (“there”), definite articles (“the”),  or pronouns (“they”). Conjecture Uncertainty: Jumping to conclusions, incorrectly completing partial information, or making assumptions. Credibility Uncertainty: Trusting a statement that was unserious, humorous, incongruous, a non sequitur, a manipulation, or an apparent lie. Evidence Uncertainty: Making a claim without  supporting evidence, either  drawing from opaque training  data or by hallucination.

Experiential: Should feel uncertain at a glance. Reflective:Should make sense upon inspection. Legible:Should be appropriately legible at appropriate moments. Implementable:Should utilize existing display technology to address compatibility.

Implemented Visual Conventions: Strikethrough Transparency Static Fill Pattern Text blur Zig-Zag Weight