Relational Databases

Relational Database Distillation: From Structured Tables to Condensed Graph Data

The University of Queens

Abstract
Relational databases (RDBs) underpin the majority of global data management systems, where information is structured into multiple interdependent tables. To effectively use the knowledge within RDBs for predictive tasks, recent advances leverage graph representation learning to capture complex inter-table relations as multi-hop dependencies. Despite achieving state-of-the-art performance, these methods remain hindered by the prohibitive storage overhead and excessive training time, due to the massive scale of the database and the computational burden of intensive message passing across interconnected tables. To alleviate these concerns, we propose and study the problem of Relational Database Distillation (RDD). Specifically, we aim to distill large-scale RDBs into compact heterogeneous graphs while retaining the predictive power (i.e., utility) required for training graph-based models. Multi-modal column information is preserved through node features, and primary-foreign key relations are encoded via heterogeneous edges, thereby maintaining both data fidelity and relational structure. To ensure adaptability across diverse downstream tasks without engaging the traditional, inefficient bi-level distillation framework, we further design a kernel ridge regression-guided objective with pseudo-labels, which produces quality features for the distilled graph. Extensive experiments on multiple real-world RDBs demonstrate that our solution substantially reduces the data size while maintaining competitive performance on classification and regression tasks, creating an effective pathway for scalable learning with RDBs.

AI Insights

RDD compresses multi‑table RDBs into heterogeneous graphs, keeping multi‑modal column data as node features.
Primary‑foreign key relations become typed edges, preserving full relational semantics without explicit joins.
A kernel ridge regression objective with pseudo‑labels replaces bi‑level distillation, producing task‑agnostic features in one pass.
Real‑world tests cut storage 10× and training time 5× while keeping accuracy on classification and regression tasks.
The method struggles with highly cyclic schemas or unclear foreign‑key structures, limiting scalability in some edge cases.
For deeper context, see “Heterogeneous Graph Condensation” and “Graph Condensation: A Survey.”

👍 👎 ♥ Save

On the Expressiveness of Languages for Querying Property Graphs in Relational Databases

Hebrew University, Israel

Abstract
SQL/PGQ is the emerging ISO standard for querying property graphs defined as views over relational data. We formalize its expressive power across three fragments: the read-only core, the read-write extension, and an extended variant with richer view definitions. Our results show that graph creation plays a central role in determining the expressiveness. The read-only fragment is strictly weaker than the read-write fragment, and the latter is still below the complexity class NL. Extending view definitions with arbitrary arity identifiers closes this gap: the extended fragment captures exactly NL. This yields a strict hierarchy of SQL/PGQ fragments, whose union covers all NL queries. On ordered structures the hierarchy collapses: once arity-2 identifiers are allowed, higher arities add no power, mirroring the classical transitive-closure collapse and underscoring the central role of view construction in property graph querying.

AI Insights

PGQL is shown to be equivalent to First‑Order Logic with Transitive Closure, allowing any reachability query to be expressed in either language.
The paper supplies bidirectional translations, enabling automated rewrites of FOLTC formulas into SQL/PGQ views and back.
These translations open optimization possibilities: FOLTC’s algebraic simplifications can be applied before query execution.
A survey of PGQL dialects and FOLTC techniques is provided, filling a gap in the literature and guiding future research.
Recommended texts: “First‑Order Logic with Transitive Closure” and “Property Graph Query Languages: A Survey” for deeper study.

Data Warehousing

👍 👎 ♥ Save

Extending ResourceLink: Patterns for Large Dataset Processing in MCP Applications

Ramapo College of New Jrs

Rate this image: 😍 👍 👎

Abstract
Large language models translate natural language into database queries, yet context window limitations prevent direct deployment in reporting systems where complete datasets exhaust available tokens. The Model Context Protocol specification defines ResourceLink for referencing external resources, but practical patterns for implementing scalable reporting architectures remain undocumented. This paper presents patterns for building LLM-powered reporting systems that decouple query generation from data retrieval. We introduce a dual-response pattern extending ResourceLink to support both iterative query refinement and out-of-band data access, accompanied by patterns for multi-tenant security and resource lifecycle management. These patterns address fundamental challenges in LLM-driven reporting applications and provide practical guidance for developers building them.

AI Insights

Dual‑response pattern lets LLMs iteratively refine queries while data is fetched outside the token budget.
It also covers multi‑tenant isolation, resource lifecycle, and progressive discovery.
The authors call for MCP enhancement proposals or RFCs to standardize discovery and REST contracts.
Recommended readings include “Visualization requirements for business intelligence analytics” and “Business intelligence and analytics: From big data to big impact.”
Cited works feature a rank‑by‑feature framework for multidimensional exploration and the r3 consensus‑based text‑to‑SQL system.
No concrete implementation is provided, inviting practitioners to experiment.
The paper assumes familiarity with MCP ResourceLink and LLMs, which may challenge newcomers.

👍 👎 ♥ Save

Aegis: A Correlation-Based Data Masking Advisor for Data Sharing Ecosystems

Abstract
Data-sharing ecosystems enable entities -- such as providers, consumers, and intermediaries -- to access, exchange, and utilize data for various downstream tasks and applications. Due to privacy concerns, data providers typically anonymize datasets before sharing them; however, the existence of multiple masking configurations results in masked datasets with varying utility. Consequently, a key challenge lies in efficiently determining the optimal masking configuration that maximizes a dataset's utility. This paper presents AEGIS, a middleware framework for identifying the optimal masking configuration for machine learning datasets that consist of features and a class label. We introduce a utility optimizer that minimizes predictive utility deviation -- a metric based on the changes in feature-label correlations before and after masking. Our framework leverages limited data summaries (such as 1D histograms) or none to estimate the feature-label joint distribution, making it suitable for scenarios where raw data is inaccessible due to privacy restrictions. To achieve this, we propose a joint distribution estimator based on iterative proportional fitting, which allows supporting various feature-label correlation quantification methods such as g3, mutual information, or chi-square. Our experimental evaluation on real-world datasets shows that AEGIS identifies optimal masking configurations over an order of magnitude faster, while the resulting masked datasets achieve predictive performance on downstream ML tasks that is on par with baseline approaches.

NoSQL Databases

👍 👎 ♥ Save

Comparative Performance Analysis of Modern NoSQL Data Technologies: Redis, Aerospike, and Dragonfly

Rate this image: 😍 👍 👎

Abstract
The rise of distributed applications and cloud computing has created a demand for scalable, high-performance key-value storage systems. This paper presents a performance evaluation of three prominent NoSQL key-value stores: Redis, Aerospike, and Dragonfly, using the Yahoo! Cloud Serving Benchmark (YCSB) framework. We conducted extensive experiments across three distinct workload patterns (read-heavy, write-heavy), and balanced while systematically varying client concurrency from 1 to 32 clients. Our evaluation methodology captures both latency, throughput, and memory characteristics under realistic operational conditions, providing insights into the performance trade-offs and scalability behaviour of each system

SQL

👍 👎 ♥ Save

LitE-SQL: A Lightweight and Efficient Text-to-SQL Framework with Vector-based Schema Linking and Execution-Guided Self-Correction

Abstract
The Text-to-SQL task translates natural language questions into SQL queries, enabling intuitive database interaction for non-experts. While recent methods leveraging Large Language Models (LLMs) achieve strong performance, their reliance on proprietary models raise concerns about deployment feasibility and data privacy. In this work, we introduce LitE-SQL, a Lightweight and Efficient framework with two components: (i) a Schema Retriever that performs efficient schema linking using a vector database of pre-computed schema embeddings, and (ii) a SQL Generator fine-tuned in two stages-supervised fine-tuning followed by execution-guided reinforcement-enabling self-correction without costly multi-candidate generation. On BIRD, LitE-SQL achieves 72.10% execution accuracy, and on Spider 1.0 it reaches 88.45%, demonstrating comparable or superior performance to LLM-based methods despite using 2x to 30x fewer parameters. Our findings demonstrate that high-quality Text-to-SQL generation is feasible with lightweight models, offering a practical solution for privacy-sensitive and resource-constrained settings.

👍 👎 ♥ Save

Agent Bain vs. Agent McKinsey: A New Text-to-SQL Benchmark for the Business Domain

Cornell University and G

Abstract
In the business domain, where data-driven decision making is crucial, text-to-SQL is fundamental for easy natural language access to structured data. While recent LLMs have achieved strong performance in code generation, existing text-to-SQL benchmarks remain focused on factual retrieval of past records. We introduce CORGI, a new benchmark specifically designed for real-world business contexts. CORGI is composed of synthetic databases inspired by enterprises such as Doordash, Airbnb, and Lululemon. It provides questions across four increasingly complex categories of business queries: descriptive, explanatory, predictive, and recommendational. This challenge calls for causal reasoning, temporal forecasting, and strategic recommendation, reflecting multi-level and multi-step agentic intelligence. We find that LLM performance drops on high-level questions, struggling to make accurate predictions and offer actionable plans. Based on execution success rate, the CORGI benchmark is about 21% more difficult than the BIRD benchmark. This highlights the gap between popular LLMs and the need for real-world business intelligence. We release a public dataset and evaluation framework, and a website for public submissions.

Interests not found

Help us improve your experience!