Imperial College London
Abstract
Understanding model decisions is crucial in medical imaging, where
interpretability directly impacts clinical trust and adoption. Vision
Transformers (ViTs) have demonstrated state-of-the-art performance in
diagnostic imaging; however, their complex attention mechanisms pose challenges
to explainability. This study evaluates the explainability of different Vision
Transformer architectures and pre-training strategies - ViT, DeiT, DINO, and
Swin Transformer - using Gradient Attention Rollout and Grad-CAM. We conduct
both quantitative and qualitative analyses on two medical imaging tasks:
peripheral blood cell classification and breast ultrasound image
classification. Our findings indicate that DINO combined with Grad-CAM offers
the most faithful and localized explanations across datasets. Grad-CAM
consistently produces class-discriminative and spatially precise heatmaps,
while Gradient Attention Rollout yields more scattered activations. Even in
misclassification cases, DINO with Grad-CAM highlights clinically relevant
morphological features that appear to have misled the model. By improving model
transparency, this research supports the reliable and explainable integration
of ViTs into critical medical diagnostic workflows.
AI Insights - A reproducible framework ranks ViT explainability across tasks, moving beyond simple accuracy.
- DINO + Grad‑CAM yields sharp, class‑discriminative heatmaps, even on misclassifications, highlighting key morphology.
- Gradient Attention Rollout produces diffuse activations, less useful for clinical interpretation.
- Evaluation relies on accuracy and F1‑score, missing richer interpretability metrics.
- Only four ViT variants and two methods were tested; a broader survey could reveal more explainable models.
- Future work should develop ViT‑specific explainability tools that align with clinical reasoning.
- Suggested reading: “Attention is All You Need” for transformer theory and “Explainable Deep Learning for Medical Imaging” for domain insights.
Abstract
In an era where AI is evolving from a passive tool into an active and
adaptive companion, we introduce AI for Service (AI4Service), a new paradigm
that enables proactive and real-time assistance in daily life. Existing AI
services remain largely reactive, responding only to explicit user commands. We
argue that a truly intelligent and helpful assistant should be capable of
anticipating user needs and taking actions proactively when appropriate. To
realize this vision, we propose Alpha-Service, a unified framework that
addresses two fundamental challenges: Know When to intervene by detecting
service opportunities from egocentric video streams, and Know How to provide
both generalized and personalized services. Inspired by the von Neumann
computer architecture and based on AI glasses, Alpha-Service consists of five
key components: an Input Unit for perception, a Central Processing Unit for
task scheduling, an Arithmetic Logic Unit for tool utilization, a Memory Unit
for long-term personalization, and an Output Unit for natural human
interaction. As an initial exploration, we implement Alpha-Service through a
multi-agent system deployed on AI glasses. Case studies, including a real-time
Blackjack advisor, a museum tour guide, and a shopping fit assistant,
demonstrate its ability to seamlessly perceive the environment, infer user
intent, and provide timely and useful assistance without explicit prompts.