University of Michigan
Abstract
Unstructured data, such as text, images, audio, and video, comprises the vast
majority of the world's information, yet it remains poorly supported by
traditional data systems that rely on structured formats for computation. We
argue for a new paradigm, which we call computing on unstructured data, built
around three stages: extraction of latent structure, transformation of this
structure through data processing techniques, and projection back into
unstructured formats. This bi-directional pipeline allows unstructured data to
benefit from the analytical power of structured computation, while preserving
the richness and accessibility of unstructured representations for human and AI
consumption. We illustrate this paradigm through two use cases and present the
research components that need to be developed in a new data system called
MXFlow.
AI Insights - MXFlow’s dynamic dataflow engine orchestrates neural and symbolic operators for seamless cross‑modal transformations.
- Built‑in cost model predicts query time, guiding optimal operator placement across text, image, and table streams.
- Unlike ETL, MXFlow supports full read‑write pipelines, enabling in‑place updates to extracted structures before projection.
- Treating LLMs as first‑class storage, MXFlow merges declarative SQL semantics with generative reasoning over unstructured inputs.
- Multimodal output layer can generate structured tables, annotated images, and natural‑language summaries simultaneously.
- See Anderson et al.’s “LLM‑powered unstructured analytics system” paper for practical implementation insights.
Abstract
The increasing volume and complexity of X-ray absorption spectroscopy (XAS)
data generated at synchrotron facilities worldwide require robust
infrastructure for data management, sharing, and analysis. This paper
introduces the XAS Database (XASDB), a comprehensive web-based platform
developed and hosted by the Canadian Light Source (CLS). The database houses
more than 1000 reference spectra spanning 40 elements and 324 chemical
compounds. The platform employs a Node.js/MongoDB architecture designed to
handle diverse data formats from multiple beamlines and synchrotron facilities.
A key innovation is the XASproc JavaScript library, which enables browser-based
XAS data processing including normalization, background sub- traction, extended
X-ray absorption fine structure (EXAFS) extraction, and preliminary analysis
traditionally limited to desktop applications. The integrated XASVue spectral
viewer provides installation-free data visualization and analysis with broad
accessibility across devices and operating systems. By offering standardized
data output, comprehensive metadata, and integrated analytical ca- pabilities,
XASDB facilitates collaborative research and promotes FAIR (Findable,
Accessible, In- teroperable, and Reusable) data principles. The platform serves
as a valuable resource for linear combination fitting (LCF) analysis, machine
learning applications, and educational purposes. This initiative demonstrates
the potential for web-centric approaches in XAS data analysis, accelerating
advances in materials science, environmental research, chemistry, and biology.