Papers from 22 to 26 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Relational Databases
👍 👎 ♥ Save
Abstract
We introduce QueryGym, an interactive environment for building, testing, and evaluating LLM-based query planning agents. Existing frameworks often tie agents to specific query language dialects or obscure their reasoning; QueryGym instead requires agents to construct explicit sequences of relational algebra operations, ensuring engine-agnostic evaluation and transparent step-by-step planning. The environment is implemented as a Gymnasium interface that supplies observations -- including schema details, intermediate results, and execution feedback -- and receives actions that represent database exploration (e.g., previewing tables, sampling column values, retrieving unique values) as well as relational algebra operations (e.g., filter, project, join). We detail the motivation and the design of the environment. In the demo, we showcase the utility of the environment by contrasting it with contemporary LLMs that query databases. QueryGym serves as a practical testbed for research in error remediation, transparency, and reinforcement learning for query generation. For the associated demo, see https://ibm.biz/QueryGym.
👍 👎 ♥ Save
Abstract
Graph analytics is widely used in many fields to analyze various complex patterns. However, in most cases, important data in companies is stored in RDBMS's, and so, it is necessary to extract graphs from relational databases to perform graph analysis. Most of the existing methods do not extract a user-intended graph since it typically requires complex join query processing. We propose an efficient graph extraction method, \textit{ExtGraph}, which can extract user-intended graphs efficiently by hybrid query processing of outer join and materialized view. Through experiments using the TPC-DS, DBLP, and IMDB datasets, we have shown that \textit{ExtGraph} outperforms the state-of-the-art methods up to by 2.78x in terms of graph extraction time.
Data Warehousing
👍 👎 ♥ Save
Universidad Politcnica
Paper visualization
Rate this image: 😍 👍 👎
Abstract
In the digital era, data spaces are emerging as key ecosystems for the secure and controlled exchange of information among participants. To achieve this, components such as metadata catalogs and data space connectors are essential. This document proposes an implementation and integration solution for both elements, considering standardization guidelines for data formats, metadata, and protocols, which ensures interoperability. A hybrid solution is presented: DataHub is used as a federated catalog for robust metadata management, leveraging its advanced ingestion, governance, and lineage capabilities. On the other hand, a custom implementation, Rainbow Catalog, manages ODRL policies for access and usage. This integration makes it possible to query datasets from DataHub and associate them with ODRL policies, facilitating negotiation and transfer flows defined by the Dataspace Protocol. The result is a system that combines the power of DataHub for large-scale cataloging with the policy management of the connector crucial for sovereignty and trust in data spaces.
AI Insights
  • DataHub’s lineage tracking pairs with Rainbow’s ODRL enforcement, enabling policy‑aware queries across federated catalogs.
  • Dataspace Protocol negotiation automates access rights, reducing manual setup.
  • Scalability derives from DataHub’s distributed ingestion and Rainbow’s horizontally scalable engine.
  • ODRL constraints embedded in query plans harden security against unauthorized exposure.
  • Implementing the hybrid stack demands deep expertise in catalog governance and policy semantics, a noted challenge.
  • FAIRness studies link semantic enrichment to discoverability, guiding catalog design.
  • Cited works span open‑government data, linked‑data archaeology, and GNAP4VP authentication, underscoring interdisciplinary roots.
👍 👎 ♥ Save
Abstract
In this paper, we investigate three cross-facility data streaming architectures, Direct Streaming (DTS), Proxied Streaming (PRS), and Managed Service Streaming (MSS). We examine their architectural variations in data flow paths and deployment feasibility, and detail their implementation using the Data Streaming to HPC (DS2HPC) architectural framework and the SciStream memory-to-memory streaming toolkit on the production-grade Advanced Computing Ecosystem (ACE) infrastructure at Oak Ridge Leadership Computing Facility (OLCF). We present a workflow-specific evaluation of these architectures using three synthetic workloads derived from the streaming characteristics of scientific workflows. Through simulated experiments, we measure streaming throughput, round-trip time, and overhead under work sharing, work sharing with feedback, and broadcast and gather messaging patterns commonly found in AI-HPC communication motifs. Our study shows that DTS offers a minimal-hop path, resulting in higher throughput and lower latency, whereas MSS provides greater deployment feasibility and scalability across multiple users but incurs significant overhead. PRS lies in between, offering a scalable architecture whose performance matches DTS in most cases.
NoSQL Databases
👍 👎 ♥ Save
SageBionetworks, OregonHe
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Continuous and reliable access to curated biological data repositories is indispensable for accelerating rigorous scientific inquiry and fostering reproducible research. Centralized repositories, though widely used, are vulnerable to single points of failure arising from cyberattacks, technical faults, natural disasters, or funding and political uncertainties. This can lead to widespread data unavailability, data loss, integrity compromises, and substantial delays in critical research, ultimately impeding scientific progress. Centralizing essential scientific resources in a single geopolitical or institutional hub is inherently dangerous, as any disruption can paralyze diverse ongoing research. The rapid acceleration of data generation, combined with an increasingly volatile global landscape, necessitates a critical re-evaluation of the sustainability of centralized models. Implementing federated and decentralized architectures presents a compelling and future-oriented pathway to substantially strengthen the resilience of scientific data infrastructures, thereby mitigating vulnerabilities and ensuring the long-term integrity of data. Here, we examine the structural limitations of centralized repositories, evaluate federated and decentralized models, and propose a hybrid framework for resilient, FAIR, and sustainable scientific data stewardship. Such an approach offers a significant reduction in exposure to governance instability, infrastructural fragility, and funding volatility, and also fosters fairness and global accessibility. The future of open science depends on integrating these complementary approaches to establish a globally distributed, economically sustainable, and institutionally robust infrastructure that safeguards scientific data as a public good, further ensuring continued accessibility, interoperability, and preservation for generations to come.
AI Insights
  • EOSC’s federated nodes already host 1 million genomes, a living model of distributed stewardship.
  • ELIXIR’s COVID‑19 response proved community pipelines can scale to pandemic‑grade data volumes.
  • The Global Biodata Coalition’s roadmap envisions a cross‑border mesh that outpaces single‑point failure risks.
  • DeSci employs blockchain provenance to give researchers immutable audit trails for every dataset.
  • NIH’s Final Data Policy now mandates FAIR compliance, nudging institutions toward hybrid decentralized architectures.
  • DeSci still struggles with interoperability, as heterogeneous metadata schemas block seamless cross‑platform queries.
  • Privacy‑by‑design in distributed repositories remains a top research gap, inviting novel cryptographic solutions.
SQL
👍 👎 ♥ Save
Abstract
Developing custom reasoning models via Reinforcement Learning (RL) that can incorporate organization-specific knowledge has great potential to address problems faced by enterprise customers. In many of these problems, the reward function is verifiable, a setting termed RL with Verifiable Rewards (RLVR). We apply RLVR to a popular data science benchmark called BIRD that measures the ability of an AI agent to convert a natural language query for a database to SQL executions. We apply a simple and general-purpose training recipe involving careful prompt and model selection, a warm-up stage using our offline RL approach called TAO, followed by rigorous online RLVR training. With no additional training data beyond the BIRD training set and no use of proprietary models, our very first submission to the BIRD leaderboard reached state-of-the-art accuracy on the private test set: 73.56% without self-consistency and 75.68% with self-consistency. In the latter case, our model also required fewer generations than the second-best approach. While BIRD is only a proxy task, the simplicity of our framework makes it broadly applicable to enterprise domains such as business intelligence, data science, and coding.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Database Design
You can edit or add more interests any time.

Unsubscribe from these updates