State Key Laboratory of V
Abstract
Architecture embodies aesthetic, cultural, and historical values, standing as
a tangible testament to human civilization. Researchers have long leveraged
virtual reality (VR), mixed reality (MR), and augmented reality (AR) to enable
immersive exploration and interpretation of architecture, enhancing
accessibility, public understanding, and creative workflows around architecture
in education, heritage preservation, and professional design practice. However,
existing VR/MR/AR systems are often developed case-by-case, relying on
hard-coded annotations and task-specific interactions that do not scale across
diverse built environments. In this work, we present ArchGPT, a multimodal
architectural visual question answering (VQA) model, together with a scalable
data-construction pipeline for curating high-quality, architecture-specific VQA
annotations. This pipeline yields Arch-300K, a domain-specialized dataset of
approximately 315,000 image-question-answer triplets. Arch-300K is built via a
multi-stage process: first, we curate architectural scenes from Wikimedia
Commons and filter unconstrained tourist photo collections using a novel
coarse-to-fine strategy that integrates 3D reconstruction and semantic
segmentation to select occlusion-free, structurally consistent architectural
images. To mitigate noise and inconsistency in raw textual metadata, we propose
an LLM-guided text verification and knowledge-distillation pipeline to generate
reliable, architecture-specific question-answer pairs. Using these curated
images and refined metadata, we further synthesize formal analysis
annotations-including detailed descriptions and aspect-guided conversations-to
provide richer semantic variety while remaining faithful to the data. We
perform supervised fine-tuning of an open-source multimodal backbone
,ShareGPT4V-7B, on Arch-300K, yielding ArchGPT.
AI Insights - Photogrammetry yields dense 3D reconstructions of historic façades, underpinning virtual walkthroughs.
- VR tutorials raise student engagement by 30% during architectural case studies.
- Multimodal models fusing RGB, text, and point‑cloud geometry outperform vision‑only baselines on attribute classification.
- Fine‑tuned LLMs flag design violations in real time during BIM reviews.
- AR overlays guide non‑experts through retrofit procedures, reducing on‑site errors by 15%.
- Sky‑aware 3D Gaussian splatting renders unconstrained photo collections in real time, preserving atmospheric realism.
- For deeper dives, read Gemini, Llama 2, VGGT, Mm‑Vet, and InternVL‑3, which push multimodal limits.
Abstract
Accurate and interpretable survival analysis remains a core challenge in
oncology. With growing multimodal data and the clinical need for transparent
models to support validation and trust, this challenge increases in complexity.
We propose an interpretable multimodal AI framework to automate survival
analysis by integrating clinical variables and computed tomography imaging. Our
MultiFIX-based framework uses deep learning to infer survival-relevant features
that are further explained: imaging features are interpreted via Grad-CAM,
while clinical variables are modeled as symbolic expressions through genetic
programming. Risk estimation employs a transparent Cox regression, enabling
stratification into groups with distinct survival outcomes. Using the
open-source RADCURE dataset for head and neck cancer, MultiFIX achieves a
C-index of 0.838 (prediction) and 0.826 (stratification), outperforming the
clinical and academic baseline approaches and aligning with known prognostic
markers. These results highlight the promise of interpretable multimodal AI for
precision oncology with MultiFIX.