Database Design

A Fast Ethereum-Compatible Forkless Database

Sonic Labs

Rate paper: 👍 👎 ♥ Save

Abstract
The State Database of a blockchain stores account data and enables authentication. Modern blockchains use fast consensus protocols to avoid forking, improving throughput and finality. However, Ethereum's StateDB was designed for a forking chain that maintains multiple state versions. While newer blockchains adopt Ethereum's standard for DApp compatibility, they do not require multiple state versions, making legacy Ethereum databases inefficient for fast, non-forking blockchains. Moreover, existing StateDB implementations have been built on key-value stores (e.g., LevelDB), which make them less efficient. This paper introduces a novel state database that is a native database implementation and maintains Ethereum compatibility while being specialized for non-forking blockchains. Our database delivers ten times speedups and 99% space reductions for validators, and a threefold decrease in storage requirements for archive nodes.

AI Summary

The trade-off yields significantly improved performance under normal conditions. [3]
The proposed design adopts a performance-oriented strategy that deliberately relaxes traditional consistency guarantees. [2]

SQuARE: Structured Query & Adaptive Retrieval Engine For Tabular Formats

S&P Global

Rate paper: 👍 👎 ♥ Save

Abstract
Accurate question answering over real spreadsheets remains difficult due to multirow headers, merged cells, and unit annotations that disrupt naive chunking, while rigid SQL views fail on files lacking consistent schemas. We present SQuARE, a hybrid retrieval framework with sheet-level, complexity-aware routing. It computes a continuous score based on header depth and merge density, then routes queries either through structure-preserving chunk retrieval or SQL over an automatically constructed relational representation. A lightweight agent supervises retrieval, refinement, or combination of results across both paths when confidence is low. This design maintains header hierarchies, time labels, and units, ensuring that returned values are faithful to the original cells and straightforward to verify. Evaluated on multi-header corporate balance sheets, a heavily merged World Bank workbook, and diverse public datasets, SQuARE consistently surpasses single-strategy baselines and ChatGPT-4o on both retrieval precision and end-to-end answer accuracy while keeping latency predictable. By decoupling retrieval from model choice, the system is compatible with emerging tabular foundation models and offers a practical bridge toward a more robust table understanding.

AI Summary

The paper presents a retrieval-augmented generation (RAG) framework for tabular question answering. [2]
The proposed RAG framework uses a combination of embedding models and SQL reasoning to improve the accuracy of tabular QA systems. [1]

Data Warehousing

Enterprise Data Science Platform: A Unified Architecture for Federated Data Access

Waseda University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Organizations struggle to share data across departments that have adopted different data analytics platforms. If n datasets must serve m environments, up to n*m replicas can emerge, increasing inconsistency and cost. Traditional warehouses copy data into vendor-specific stores; cross-platform access is hard. This study proposes the Enterprise Data Science Platform (EDSP), which builds on data lakehouse architecture and follows a Write-Once, Read-Anywhere principle. EDSP enables federated data access for multi-query engine environments, targeting data science workloads with periodic data updates and query response times ranging from seconds to minutes. By providing centralized data management with federated access from multiple query engines to the same data sources, EDSP eliminates data duplication and vendor lock-in inherent in traditional data warehouses. The platform employs a four-layer architecture: Data Preparation, Data Store, Access Interface, and Query Engines. This design enforces separation of concerns and reduces the need for data migration when integrating additional analytical environments. Experimental results demonstrate that major cloud data warehouses and programming environments can directly query EDSP-managed datasets. We implemented and deployed EDSP in production, confirming interoperability across multiple query engines. For data sharing across different analytical environments, EDSP achieves a 33-44% reduction in operational steps compared with conventional approaches requiring data migration. Although query latency may increase by up to a factor of 2.6 compared with native tables, end-to-end completion times remain on the order of seconds, maintaining practical performance for analytical use cases. Based on our production experience, EDSP provides practical design guidelines for addressing the data-silo problem in multi-query engine environments.

AI Summary

{ "title": "Enterprise Data Science Platform (EDSP)", "description": "A unified data management architecture that addresses data management challenges in multi-query engine environments." } { "term": "Write-Once, Read-Anywhere", "definition": "A principle that enables data to be written once and read from multiple query engines without replication or duplication." } { "title": "EDSP Demonstrates Practical Solution to Data Silos in Multi-Query Engine Enterprises" , "description": "The Enterprise Data Science Platform (EDSP) demonstrates that the Write-Once, Read-Anywhere principle can be realized in production environments, offering a practical solution to the long-standing problem of data silos in multi-query engine enterprises." } { "title": "Limited Performance Validation" , "description": "Future work includes performance validation on TB-scale datasets." } { "title": "Data Lake Architectures and Metadata Management" , "description": "The paper references a study on data lake architectures and metadata management, highlighting the importance of metadata in data sharing across heterogeneous query engines." } The paper proposes the Enterprise Data Science Platform (EDSP), a unified data management architecture grounded in the Write-Once, Read-Anywhere principle, to address data management challenges in multi-query engine environments. [2]

DAComp: Benchmarking Data Agents across the Full Data Intelligence Lifecycle

CAS

Rate paper: 👍 👎 ♥ Save

Abstract
Real-world enterprise data intelligence workflows encompass data engineering that turns raw sources into analytical-ready tables and data analysis that convert those tables into decision-oriented insights. We introduce DAComp, a benchmark of 210 tasks that mirrors these complex workflows. Data engineering (DE) tasks require repository-level engineering on industrial schemas, including designing and building multi-stage SQL pipelines from scratch and evolving existing systems under evolving requirements. Data analysis (DA) tasks pose open-ended business problems that demand strategic planning, exploratory analysis through iterative coding, interpretation of intermediate results, and the synthesis of actionable recommendations. Engineering tasks are scored through execution-based, multi-metric evaluation. Open-ended tasks are assessed by a reliable, experimentally validated LLM-judge, which is guided by hierarchical, meticulously crafted rubrics. Our experiments reveal that even state-of-the-art agents falter on DAComp. Performance on DE tasks is particularly low, with success rates under 20%, exposing a critical bottleneck in holistic pipeline orchestration, not merely code generation. Scores on DA tasks also average below 40%, highlighting profound deficiencies in open-ended reasoning and demonstrating that engineering and analysis are distinct capabilities. By clearly diagnosing these limitations, DAComp provides a rigorous and realistic testbed to drive the development of truly capable autonomous data agents for enterprise settings. Our data and code are available at https://da-comp.github.io

AI Summary

DAComp is a comprehensive benchmark designed to evaluate data agents across the full data intelligence lifecycle. [3]
The benchmark aims to steer the community beyond mere technical accuracy, driving the evolution of truly autonomous and capable data agents for the enterprise. [3]
Data Agent (DA): An LLM-driven autonomous system that plans and executes end-to-end workflows, acquiring, transforming, and analyzing data via tool use and code execution to achieve user-defined objectives. [3]
LLM: Large Language Model DAComp: Data Agent Comprehensive Benchmark DAComp is a rigorous standard for evaluating data agents, bridging the gap between isolated code generation and real-world enterprise demands. [3]
The benchmark includes two testbeds: DAComp-DE for repository-level pipeline orchestration and DAComp-DA for open-ended analytical reasoning. [2]

SQL

Aggregate then evaluate

University of Waterloo

Rate paper: 👍 👎 ♥ Save

Abstract
We distinguish two frameworks for decisions under ambiguity: evaluate-then-aggregate (ETA) and aggregate-then-evaluate (ATE). Given a statistic that represents the decision maker's pure-risk preferences (such as expected utility) and an ambiguous act, an ETA model first evaluates the act under each plausible probabilistic model using this statistic and then aggregates the resulting evaluations according to ambiguity attitudes. In contrast, an ATE model first aggregates ambiguity by assigning the act a single representative distribution and then evaluates that distribution using the statistic. These frameworks differ in the order in which risk and ambiguity are processed, and they coincide when there is no ambiguity. While most existing ambiguity models fall within the ETA framework, our study focuses on the ATE framework, which is conceptually just as compelling and has been relatively neglected in the literature. We develop a Choquet ATE model, which generalizes the Choquet expected utility model by allowing arbitrary pure-risk preferences. We provide an axiomatization of this model in a Savage setting with an exogenous source of unambiguous events. The Choquet ATE framework allows us to analyze a wide range of ambiguity attitudes and their interplay with risk attitudes.

NoSQL Databases

High-Performance DBMSs with io_uring: When and How to use it

Technische Universitt D

Rate paper: 👍 👎 ♥ Save

Abstract
We study how modern database systems can leverage the Linux io_uring interface for efficient, low-overhead I/O. io_uring is an asynchronous system call batching interface that unifies storage and network operations, addressing limitations of existing Linux I/O interfaces. However, naively replacing traditional I/O interfaces with io_uring does not necessarily yield performance benefits. To demonstrate when io_uring delivers the greatest benefits and how to use it effectively in modern database systems, we evaluate it in two use cases: Integrating io_uring into a storage-bound buffer manager and using it for high-throughput data shuffling in network-bound analytical workloads. We further analyze how advanced io_uring features, such as registered buffers and passthrough I/O, affect end-to-end performance. Our study shows when low-level optimizations translate into tangible system-wide gains and how architectural choices influence these benefits. Building on these insights, we derive practical guidelines for designing I/O-intensive systems using io_uring and validate their effectiveness in a case study of PostgreSQL's recent io_uring integration, where applying our guidelines yields a performance improvement of 14%.

Interests not found

Help us improve your experience!