Abstract
Modern computer systems often rely on syslog, a simple, universal protocol
that records every critical event across heterogeneous infrastructure. However,
healthcare's rapidly growing clinical AI stack has no equivalent. As hospitals
rush to pilot large language models and other AI-based clinical decision
support tools, we still lack a standard way to record how, when, by whom, and
for whom these AI models are used. Without that transparency and visibility, it
is challenging to measure real-world performance and outcomes, detect adverse
events, or correct bias or dataset drift. In the spirit of syslog, we introduce
MedLog, a protocol for event-level logging of clinical AI. Any time an AI model
is invoked to interact with a human, interface with another algorithm, or act
independently, a MedLog record is created. This record consists of nine core
fields: header, model, user, target, inputs, artifacts, outputs, outcomes, and
feedback, providing a structured and consistent record of model activity. To
encourage early adoption, especially in low-resource settings, and minimize the
data footprint, MedLog supports risk-based sampling, lifecycle-aware retention
policies, and write-behind caching; detailed traces for complex, agentic, or
multi-stage workflows can also be captured under MedLog. MedLog can catalyze
the development of new databases and software to store and analyze MedLog
records. Realizing this vision would enable continuous surveillance, auditing,
and iterative improvement of medical AI, laying the foundation for a new form
of digital epidemiology.
Mass General Brigham, MIT
Abstract
Large language models (LLMs) integrated into agent-driven workflows hold
immense promise for healthcare, yet a significant gap exists between their
potential and practical implementation within clinical settings. To address
this, we present a practitioner-oriented field manual for deploying generative
agents that use electronic health record (EHR) data. This guide is informed by
our experience deploying the "irAE-Agent", an automated system to detect
immune-related adverse events from clinical notes at Mass General Brigham, and
by structured interviews with 20 clinicians, engineers, and informatics leaders
involved in the project. Our analysis reveals a critical misalignment in
clinical AI development: less than 20% of our effort was dedicated to prompt
engineering and model development, while over 80% was consumed by the
sociotechnical work of implementation. We distill this effort into five "heavy
lifts": data integration, model validation, ensuring economic value, managing
system drift, and governance. By providing actionable solutions for each of
these challenges, this field manual shifts the focus from algorithmic
development to the essential infrastructure and implementation work required to
bridge the "valley of death" and successfully translate generative AI from
pilot projects into routine clinical care.