Papers from 15 to 19 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Relational Databases
👍 👎 ♥ Save
Abstract
Static analyzers are complex pieces of software with large dependencies. They can be difficult to install, which hinders adoption and creates barriers for students learning static analysis. This work introduces Try-Mopsa: a scaled-down version of the Mopsa static analysis platform, compiled into JavaScript to run purely as a client-side application in web browsers. Try-Mopsa provides a responsive interface that works on both desktop and mobile devices. Try-Mopsa features all the core components of Mopsa. In particular, it supports relational numerical domains. We present the interface, changes and adaptations required to have a pure JavaScript version of Mopsa. We envision Try-Mopsa as a convenient platform for onboarding or teaching purposes.
👍 👎 ♥ Save
Abstract
In this technical report, we present a formalisation of the MongoDB aggregation framework. Our aim is to identify a fragment that could serve as the starting point for an industry-wide standard for querying JSON document databases. We provide a syntax and formal semantics for a set of selected operators, We show how this fragment relates to known relational query languages. We explain how our semantics differs from the current implementation of MongoDB, and justify our choices. We provide a set of algebraic transformations that can be used for query optimisation.
Data Warehousing
👍 👎 ♥ Save
University of Michigan
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Unstructured data, such as text, images, audio, and video, comprises the vast majority of the world's information, yet it remains poorly supported by traditional data systems that rely on structured formats for computation. We argue for a new paradigm, which we call computing on unstructured data, built around three stages: extraction of latent structure, transformation of this structure through data processing techniques, and projection back into unstructured formats. This bi-directional pipeline allows unstructured data to benefit from the analytical power of structured computation, while preserving the richness and accessibility of unstructured representations for human and AI consumption. We illustrate this paradigm through two use cases and present the research components that need to be developed in a new data system called MXFlow.
AI Insights
  • MXFlow’s dynamic dataflow engine orchestrates neural and symbolic operators for seamless cross‑modal transformations.
  • Built‑in cost model predicts query time, guiding optimal operator placement across text, image, and table streams.
  • Unlike ETL, MXFlow supports full read‑write pipelines, enabling in‑place updates to extracted structures before projection.
  • Treating LLMs as first‑class storage, MXFlow merges declarative SQL semantics with generative reasoning over unstructured inputs.
  • Multimodal output layer can generate structured tables, annotated images, and natural‑language summaries simultaneously.
  • See Anderson et al.’s “LLM‑powered unstructured analytics system” paper for practical implementation insights.
👍 👎 ♥ Save
Abstract
The increasing volume and complexity of X-ray absorption spectroscopy (XAS) data generated at synchrotron facilities worldwide require robust infrastructure for data management, sharing, and analysis. This paper introduces the XAS Database (XASDB), a comprehensive web-based platform developed and hosted by the Canadian Light Source (CLS). The database houses more than 1000 reference spectra spanning 40 elements and 324 chemical compounds. The platform employs a Node.js/MongoDB architecture designed to handle diverse data formats from multiple beamlines and synchrotron facilities. A key innovation is the XASproc JavaScript library, which enables browser-based XAS data processing including normalization, background sub- traction, extended X-ray absorption fine structure (EXAFS) extraction, and preliminary analysis traditionally limited to desktop applications. The integrated XASVue spectral viewer provides installation-free data visualization and analysis with broad accessibility across devices and operating systems. By offering standardized data output, comprehensive metadata, and integrated analytical ca- pabilities, XASDB facilitates collaborative research and promotes FAIR (Findable, Accessible, In- teroperable, and Reusable) data principles. The platform serves as a valuable resource for linear combination fitting (LCF) analysis, machine learning applications, and educational purposes. This initiative demonstrates the potential for web-centric approaches in XAS data analysis, accelerating advances in materials science, environmental research, chemistry, and biology.
SQL
👍 👎 ♥ Save
The Hong Kong University
Abstract
Natural Language to SQL (NL2SQL) provides a new model-centric paradigm that simplifies database access for non-technical users by converting natural language queries into SQL commands. Recent advancements, particularly those integrating Retrieval-Augmented Generation (RAG) and Chain-of-Thought (CoT) reasoning, have made significant strides in enhancing NL2SQL performance. However, challenges such as inaccurate task decomposition and keyword extraction by LLMs remain major bottlenecks, often leading to errors in SQL generation. While existing datasets aim to mitigate these issues by fine-tuning models, they struggle with over-fragmentation of tasks and lack of domain-specific keyword annotations, limiting their effectiveness. To address these limitations, we present DeKeyNLU, a novel dataset which contains 1,500 meticulously annotated QA pairs aimed at refining task decomposition and enhancing keyword extraction precision for the RAG pipeline. Fine-tuned with DeKeyNLU, we propose DeKeySQL, a RAG-based NL2SQL pipeline that employs three distinct modules for user question understanding, entity retrieval, and generation to improve SQL generation accuracy. We benchmarked multiple model configurations within DeKeySQL RAG pipeline. Experimental results demonstrate that fine-tuning with DeKeyNLU significantly improves SQL generation accuracy on both BIRD (62.31% to 69.10%) and Spider (84.2% to 88.7%) dev datasets.
👍 👎 ♥ Save
Abstract
In this study, we optimize SQL+ML queries on top of OpenMLDB, an open-source database that seamlessly integrates offline and online feature computations. The work used feature-rich synthetic dataset experiments in Docker, which acted like production environments that processed 100 to 500 records per batch and 6 to 12 requests per batch in parallel. Efforts have been concentrated in the areas of better query plans, cached execution plans, parallel processing, and resource management. The experimental results show that OpenMLDB can support approximately 12,500 QPS with less than 1 ms latency, outperforming SparkSQL and ClickHouse by a factor of 23 and PostgreSQL and MySQL by 3.57 times. This study assessed the impact of optimization and showed that query plan optimization accounted for 35% of the performance gains, caching for 25%, and parallel processing for 20%. These results illustrate OpenMLDB's capability for time-sensitive ML use cases, such as fraud detection, personalized recommendation, and time series forecasting. The system's modular optimization framework, which combines batch and stream processing without interference, contributes to its significant performance gain over traditional database systems, particularly in applications that require real-time feature computation and serving. This study contributes to the understanding and design of high-performance SQL+ML systems and highlights the need for specialized SQL optimization for ML workloads.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • NoSQL Databases
  • Database Design
You can edit or add more interests any time.

Unsubscribe from these updates