Relational Databases

Relational Algebras for Subset Selection and Optimisation

Abstract
The database community lacks a unified relational query language for subset selection and optimisation queries, limiting both user expression and query optimiser reasoning about such problems. Decades of research (latterly under the rubric of prescriptive analytics) have produced powerful evaluation algorithms with incompatible, ad-hoc SQL extensions that specify and filter through distinct mechanisms. We present the first unified algebraic foundation for these queries, introducing relational exponentiation to complete the fundamental algebraic operations alongside union (addition) and cross product (multiplication). First, we extend relational algebra to complete domain relations-relations defined by characteristic functions rather than explicit extensions-achieving the expressiveness of NP-complete/hard problems, while simultaneously providing query safety for finite inputs. Second, we introduce solution sets, a higher-order relational algebra over sets of relations that naturally expresses search spaces as functions f: Base to Decision, yielding |Decision|^|Base| candidate relations. Third, we provide structure-preserving translation semantics from solution sets to standard relational algebra, enabling mechanical translation to existing evaluation algorithms. This framework achieves the expressiveness of the most powerful prior approaches while providing the theoretical clarity and compositional properties absent in previous work. We demonstrate the capabilities these algebras open up through a polymorphic SQL where standard clauses seamlessly express data management, subset selection, and optimisation queries within a single paradigm.

👍 👎 ♥ Save

Database Views as Explanations for Relational Deep Learning

Northeastern University

Abstract
In recent years, there has been significant progress in the development of deep learning models over relational databases, including architectures based on heterogeneous graph neural networks (hetero-GNNs) and heterogeneous graph transformers. In effect, such architectures state how the database records and links (e.g., foreign-key references) translate into a large, complex numerical expression, involving numerous learnable parameters. This complexity makes it hard to explain, in human-understandable terms, how a model uses the available data to arrive at a given prediction. We present a novel framework for explaining machine-learning models over relational databases, where explanations are view definitions that highlight focused parts of the database that mostly contribute to the model's prediction. We establish such global abductive explanations by adapting the classic notion of determinacy by Nash, Segoufin, and Vianu (2010). In addition to tuning the tradeoff between determinacy and conciseness, the framework allows controlling the level of granularity by adopting different fragments of view definitions, such as ones highlighting whole columns, foreign keys between tables, relevant groups of tuples, and so on. We investigate the realization of the framework in the case of hetero-GNNs. We develop heuristic algorithms that avoid the exhaustive search over the space of all databases. We propose techniques that are model-agnostic, and others that are tailored to hetero-GNNs via the notion of learnable masking. Our approach is evaluated through an extensive empirical study on the RelBench collection, covering a variety of domains and different record-level tasks. The results demonstrate the usefulness of the proposed explanations, as well as the efficiency of their generation.

AI Insights

GNNExplainer (Ying et al. 2019) is cited as a seminal method for extracting subgraph importance in GNNs.
XGNN (Yuan et al. 2020) is referenced for its graph‑synthesis approach to model‑level explanations.
ContextGNN (Yuan et al. 2025) is noted for extending explainability to two‑tower recommendation architectures.
The authors recommend “Probabilistic Databases” by Suciu et al. (2011) for a rigorous treatment of uncertainty in relational settings.
The paper points out that current GNN explainers are limited in scope, calling for more comprehensive methods.
The literature review underscores the importance of feature importance, saliency maps, and model‑level explanations as core GNN interpretability techniques.

Data Warehousing

👍 👎 ♥ Save

Heterogeneous Agents in the Data Economy

Zhejiang International

Abstract
In this short paper, we define the investment ability of data investors in the data economy and its heterogeneity. We further construct an analytical heterogeneous agent model to demonstrate that differences in data investment ability lead to divergent economic results for data investors. The analytical results prove that: Investors with higher data investment ability can obtain greater utility through data investment, and thus have stronger incentives to invest in a larger scale of data to achieve higher productivity, technological progress, and experience lower financial frictions. We aim to propose a prerequisite theory that extends the analytical framework of the data economy from the currently prevalent representative agent model to a heterogeneous agent model.

AI Insights

The paper shows how data‑investment heterogeneity fuels a winner‑takes‑most dynamic in the data economy.
It highlights that firms with superior data‑processing skills capture disproportionate market share, deepening income inequality.
The authors propose targeted regulations—data‑collection caps and transparency mandates—to curb wealth concentration.
Data Economy – a system where data is a tradable, revenue‑generating asset.
Big Data – voluminous, complex datasets that challenge conventional analytics.
For deeper insight, read Corrado et al. “Data, Intangible Capital, and Productivity” (2024) on data’s macro‑productivity impact.
Farboodi & Veldkamp’s 2021 model offers a benchmark for comparing heterogeneous‑agent outcomes in data markets.

👍 👎 ♥ Save

A Study on Messaging Trade-offs in Data Streaming for Scientific Workflows

National Center for Compu

Abstract
Memory-to-memory data streaming is essential for modern scientific workflows that require near real-time data analysis, experimental steering, and informed decision-making during experiment execution. It eliminates the latency bottlenecks associated with file-based transfers to parallel storage, enabling rapid data movement between experimental facilities and HPC systems. These tightly coupled experimental-HPC workflows demand low latency, high throughput, and reliable data delivery to support on-the-fly analysis and timely feedback for experimental control. Off-the-shelf messaging frameworks are increasingly considered viable solutions for enabling such direct memory streaming due to their maturity, broad adoption, and ability to abstract core messaging and reliability functionalities from the application layer. However, effectively meeting the workflows' requirements depends on utilizing the framework's capabilities and carefully tuning its configurations. In this paper, we present a study that investigates the messaging parameters, and their configuration choices that impact the streaming requirements of two representative scientific workflows. We specifically characterize throughput trade-offs associated with reliable message transmission for these workflows. Our study is conducted through streaming simulations using synthetic workloads derived from the Deleria and LCLS workflows, employing the RabbitMQ messaging framework within the context of the Data Streaming to HPC infrastructure at OLCF. Our simulations reveal several key observations and practical insights that help users understand which configurations best meet the needs of their streaming workloads.

AI Insights

The study pioneers AI‑coupled HPC workflows that fuse machine‑learning inference with real‑time data streams, slashing analysis latency by an order of magnitude.
It proposes a unified, API‑driven research infrastructure that stitches together experimental instruments, data‑caching layers, and HPC back‑ends into a single, secure pipeline.
The authors demonstrate how deep‑learning pipelines can be embedded directly into the streaming fabric, enabling on‑the‑fly feature extraction for materials science and neutron crystallography.
A key insight is that secure, role‑based API gateways not only protect sensitive experimental data but also reduce overhead by eliminating redundant authentication hops.
The paper highlights that achieving seamless transitions from lab to production HPC requires coordinated tooling, not just high‑bandwidth links, and recommends a modular plug‑in architecture.
The authors caution that the proposed framework demands significant infrastructure investment and skilled personnel, underscoring the need for community‑wide tooling standards.
Finally, the study calls for collaborative R&D to refine the AI‑coupled workflow model, suggesting that shared benchmarks could accelerate adoption across scientific domains.

SQL

👍 👎 ♥ Save

Changing the Paradigm from Dynamic Queries to LLM-generated SQL Queries with Human Intervention

Inria, SNUSungbok ShinIn

Rate this image: 😍 👍 👎

Abstract
We propose leveraging Large Language Models (LLMs) as an interaction layer for medical visualization systems. In domains like healthcare, where users must navigate high-dimensional, coded, and heterogeneous datasets, LLM-generated queries enable expert medical users to express complex analytical intents in natural language. These intents are then translated into editable and executable queries, replacing the dynamic query interfaces used by traditional visualization systems built around sliders, check boxes, and drop-downs. This interaction model reduces visual clutter and eliminates the need for users to memorize field names or system codes, supporting fluid exploration, with the drawback of not exposing all the filtering criteria. We also reintroduce dynamic queries on demand to better support interactive exploration. We posit that medical users are trained to know the possible filtering options but challenged to remember the details of the attribute names and code values. We demonstrate this paradigm in ParcoursVis, our scalable EventFlow-inspired patient care pathway visualization system powered by the French National Health Data System, one of the largest health data repositories in the world.

👍 👎 ♥ Save

SQLGovernor: An LLM-powered SQL Toolkit for Real World Application

Tencent Inc, Peking Unv

Abstract
SQL queries in real world analytical environments, whether written by humans or generated automatically often suffer from syntax errors, inefficiency, or semantic misalignment, especially in complex OLAP scenarios. To address these challenges, we propose SQLGovernor, an LLM powered SQL toolkit that unifies multiple functionalities, including syntax correction, query rewriting, query modification, and consistency verification within a structured framework enhanced by knowledge management. SQLGovernor introduces a fragment wise processing strategy to enable fine grained rewriting and localized error correction, significantly reducing the cognitive load on the LLM. It further incorporates a hybrid self learning mechanism guided by expert feedback, allowing the system to continuously improve through DBMS output analysis and rule validation. Experiments on benchmarks such as BIRD and BIRD CRITIC, as well as industrial datasets, show that SQLGovernor consistently boosts the performance of base models by up to 10%, while minimizing reliance on manual expertise. Deployed in production environments, SQLGovernor demonstrates strong practical utility and effective performance.

AI Insights

SQLGovernor uses fragment‑wise processing to localize rewrites, lightening the LLM’s reasoning load.
A hybrid self‑learning loop, guided by expert feedback and DBMS logs, refines rewrite rules continuously.
Its rule‑validation engine cross‑checks queries against a curated knowledge base, catching semantic drift early.
By unifying syntax correction, rewriting, modification, and consistency checks, it replaces multiple separate tools.
Continuous learning from real‑world outcomes lets the model improve without manual tuning.
Production deployments show high reliability while cutting manual debugging time.
The knowledge‑management layer stores best‑practice patterns, enabling rapid adaptation to new schemas.

Interests not found

Help us improve your experience!