Papers from 29 to 03 October, 2025

Here are the personalized paper recommendations sorted by most relevant

Relational Databases

PAT: Pattern-Perceptive Transformer for Error Detection in Relational Databases

Harbin Institute of Techn

Rate this image: 😍 👍 👎

Abstract
Error detection in relational databases is critical for maintaining data quality and is fundamental to tasks such as data cleaning and assessment. Current error detection studies mostly employ the multi-detector approach to handle heterogeneous attributes in databases, incurring high costs. Additionally, their data preprocessing strategies fail to leverage the variable-length characteristic of data sequences, resulting in reduced accuracy. In this paper, we propose an attribute-wise PAttern-perceptive Transformer (PAT) framework for error detection in relational databases. First, PAT introduces a learned pattern module that captures attribute-specific data distributions through learned embeddings during model training. Second, the Quasi-Tokens Arrangement (QTA) tokenizer is designed to divide the cell sequence based on its length and word types, and then generate the word-adaptive data tokens, meanwhile providing compact hyperparameters to ensure efficiency. By interleaving data tokens with the attribute-specific pattern tokens, PAT jointly learns shared data features across different attributes and pattern features that are distinguishable and unique in each specified attribute. Third, PAT visualizes the attention map to interpret its error detection mechanism. Extensive experiments show that PAT achieves excellent F1 scores compared to state-of-the-art data error detection methods. Moreover, PAT significantly reduces the model parameters and FLOPs when applying the compact QTA tokenizer.

👍 👎 ♥ Save

Exploring Database Normalization Effects on SQL Generation

CyberAgent

Abstract
Schema design, particularly normalization, is a critical yet often overlooked factor in natural language to SQL (NL2SQL) systems. Most prior research evaluates models on fixed schemas, overlooking the influence of design on performance. We present the first systematic study of schema normalization's impact, evaluating eight leading large language models on synthetic and real-world datasets with varied normalization levels. We construct controlled synthetic datasets with formal normalization (1NF-3NF) and real academic paper datasets with practical schemes. Our results show that denormalized schemas offer high accuracy on simple retrieval queries, even with cost-effective models in zero-shot settings. In contrast, normalized schemas (2NF/3NF) introduce challenges such as errors in base table selection and join type prediction; however, these issues are substantially mitigated by providing few-shot examples. For aggregation queries, normalized schemas yielded better performance, mainly due to their robustness against the data duplication and NULL value issues that cause errors in denormalized schemas. These findings suggest that the optimal schema design for NL2SQL applications depends on the types of queries to be supported. Our study demonstrates the importance of considering schema design when developing NL2SQL interfaces and integrating adaptive schema selection for real-world scenarios.

Data Warehousing

👍 👎 ♥ Save

Data Management System Analysis for Distributed Computing Workloads

Brookhaven National Lab

Rate this image: 😍 👍 👎

Abstract
Large-scale international collaborations such as ATLAS rely on globally distributed workflows and data management to process, move, and store vast volumes of data. ATLAS's Production and Distributed Analysis (PanDA) workflow system and the Rucio data management system are each highly optimized for their respective design goals. However, operating them together at global scale exposes systemic inefficiencies, including underutilized resources, redundant or unnecessary transfers, and altered error distributions. Moreover, PanDA and Rucio currently lack shared performance awareness and coordinated, adaptive strategies. This work charts a path toward co-optimizing the two systems by diagnosing data-management pitfalls and prioritizing end-to-end improvements. With the observation of spatially and temporally imbalanced transfer activities, we develop a metadata-matching algorithm that links PanDA jobs and Rucio datasets at the file level, yielding a complete, fine-grained view of data access and movement. Using this linkage, we identify anomalous transfer patterns that violate PanDA's data-centric job-allocation principle. We then outline mitigation strategies for these patterns and highlight opportunities for tighter PanDA-Rucio coordination to improve resource utilization, reduce unnecessary data movement, and enhance overall system resilience.

👍 👎 ♥ Save

Data Quality Taxonomy for Data Monetization

University College Cork

Abstract
This chapter presents a comprehensive taxonomy for assessing data quality in the context of data monetisation, developed through a systematic literature review. Organising over one hundred metrics and Key Performance Indicators (KPIs) into four subclusters (Fundamental, Contextual, Resolution, and Specialised) within the Balanced Scorecard (BSC) framework, the taxonomy integrates both universal and domain-specific quality dimensions. By positioning data quality as a strategic connector across the BSC's Financial, Customer, Internal Processes, and Learning & Growth perspectives, it demonstrates how quality metrics underpin valuation accuracy, customer trust, operational efficiency, and innovation capacity. The framework's interconnected "metrics layer" ensures that improvements in one dimension cascade into others, maximising strategic impact. This holistic approach bridges the gap between granular technical assessment and high-level decision-making, offering practitioners, data stewards, and strategists a scalable, evidence-based reference for aligning data quality management with sustainable value creation.

SQL

👍 👎 ♥ Save

FINCH: Financial Intelligence using Natural language for Contextualized SQL Handling

Domyn

Abstract
Text-to-SQL, the task of translating natural language questions into SQL queries, has long been a central challenge in NLP. While progress has been significant, applying it to the financial domain remains especially difficult due to complex schema, domain-specific terminology, and high stakes of error. Despite this, there is no dedicated large-scale financial dataset to advance research, creating a critical gap. To address this, we introduce a curated financial dataset (FINCH) comprising 292 tables and 75,725 natural language-SQL pairs, enabling both fine-tuning and rigorous evaluation. Building on this resource, we benchmark reasoning models and language models of varying scales, providing a systematic analysis of their strengths and limitations in financial Text-to-SQL tasks. Finally, we propose a finance-oriented evaluation metric (FINCH Score) that captures nuances overlooked by existing measures, offering a more faithful assessment of model performance.

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.

NoSQL Databases
Database Design

You can edit or add more interests any time.

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback

Unsubscribe from these updates