PI: Helen Armstrong
Co-PI: Dr. Matthew Peterson
Research Assistants: Rebecca Planchart & Kweku Baidoo

The NC State Laboratory for Analytic Sciences Team.

RESEARCH QUESTION. How can interactive visualizations of confidence scores enable language analysts to more effectively calibrate trust with speaker model outputs?

RESEARCH OBJECTIVE: Reveal potential innovations in visualization and interface design that will increase the likelihood of language analysts efficiently validating speaker model outputs, especially from a human-machine trust calibration perspective. 

OBJECTIVE 1: Explore and evaluate potential visualizations and UX patterns for signifying confidence and uncertainty in speaker model outputs. 

OBJECTIVE 2: Create three different visual prototypes — in this case, mockups that provide explicit visual specifications for implementation — representing three possible solutions to this problem space. These visual prototypes should be structured so that usability testing might be efficiently conducted by the IC at the conclusion of the project. 

In this 5 month project, our team dug into UX/UI A.I. explainability challenges related to speech recognition and speech diarization. Language analysts currently work with percentage scores that express the confidence and uncertainty of their speaker model’s output.

Analysts struggle to interpret or validate these numerical scores for which existing software platforms provide little context. The scores and surrounding user interface fail to provide context-sensitive layers of explainability, leading to analysts’ misunderstanding and trust miscalibrations of LLMs.

Our team explored alternative visualization methods, situating the results within current language analyst workflows. We worked with LAS-side developers and potential users to create testable prototypes that demonstrate these alternative visualizations in use, revealing additional explainability features that might bolster existing software capabilities.

This material is based upon work done, in whole or in part, in coordination with the Department of Defense (DoD). Any opinions, findings, conclusions, or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the DoD and/or any agency or entity of the United States Government.

Resulting Prototype Scenario Videos

Square Digit Scenario Video

Arc Gauge Scenario Video

Bar Fill Scenario Video

Research Outcomes

Key Insight One: Analyst Assumptions About Confidence Scores. Most analysts believe that confidence scores consistently and logically measure the probability that the identified speaker is their target. However, analyst experience demonstrates that confidence scores do not carry equal weight.
Key Insight Two: Analyst Truth Marking Behavior. Analysts often rely on truth marking to validate speaker IDs, but analyst truth-marking behavior varies
Key Insight Three: Delayed Feedback Frustration. Analysts grow frustrated when their truth marking is not immediately reflected in model outpu
Clear signaling can make visual scanning far more efficient. Image of block of gray letters with some letter highlighted in red.
Abstract concepts such as confidence lack a direct corollary in the real world. Icon: drawing of mouse. Index: picture of mouse trap. Symbol: word mouse
Depth of Engagement Framework. Chart with Shallow on one end of spectrum and Deep the other. The words: Notice appears on the shallow end with Read, Probe and Inspect moving into the deep end.
Depth of Engagement in Visual Elements and Interface Patterns. Chart lists element and patterns that fall at these different levels of engagement such as notice, read, probe and inspect.
Confidence Indicators. Three different visual systems for indicating confidence. System 1: Square digits. Dark blue and light blue background squares with numerals in black inside the squares. Numeral range from -1 to 9. A check mark indicates “10” or human verified. System 2: Arc Gauge: A 5 their range. 0 (unidentified speaker) is an unfilled half circle. 20-24% (Fair) is a slightly filled circle. 45-80% (Good) is a half filled half circle. 80-99% (Excellent) is an almost filled half circle. 100% (analyst verified): completely filled half circle. System 3: Bar Fill: image the looks like bar which is filled at different levels to indicate confidence. 0% (unidentified speaker) : bar is unfilled. 20-45% (Fair): Bar is slightly filled. 45-80% (Good): Bar is half filled. 90-99% (excellent): Bar is almost filled. 100% (analyst verified): bar is filled.
Image of visual confidence threshold setting feature. Sliding spectrum.
Image of Certainty Description via Click Over Confidence Indicator feature
Image of Depth of engagement framework in action
Image of truth marking history accessible on hover via “i” symbol feature
Image of hover indicates last person to truth mark and time to model update feature
Image of thumbs up/thumbs down feature in comparison to three point scale
Image of five alternative speaker match feature
Image of top speaker choice editing feature.
Impact: Reimagined confidence indicators can help analysts appropriately calibrate trust in AI thus supporting successful human-machine teams.