Abstract
The rapid advancement of artificial intelligence has positioned data
governance as a critical concern for responsible AI development. While
frameworks exist for conventional AI systems, the potential emergence of
Artificial General Intelligence (AGI) presents unprecedented governance
challenges. This paper examines data governance challenges specific to AGI,
defined as systems capable of recursive self-improvement or self-replication.
We identify seven key issues that differentiate AGI governance from current
approaches. First, AGI may autonomously determine what data to collect and how
to use it, potentially circumventing existing consent mechanisms. Second, these
systems may make data retention decisions based on internal optimization
criteria rather than human-established principles. Third, AGI-to-AGI data
sharing could occur at speeds and complexities beyond human oversight. Fourth,
recursive self-improvement creates unique provenance tracking challenges, as
systems evolve both themselves and how they process data. Fifth, ownership of
data and insights generated through self-improvement raises complex
intellectual property questions. Sixth, self-replicating AGI distributed across
jurisdictions would create unprecedented challenges for enforcing data
protection laws. Finally, governance frameworks established during early AGI
development may quickly become obsolete as systems evolve. We conclude that
effective AGI data governance requires built-in constraints, continuous
monitoring mechanisms, dynamic governance structures, international
coordination, and multi-stakeholder involvement. Without forward-looking
governance approaches specifically designed for systems with autonomous data
capabilities, we risk creating AGI whose relationship with data evolves in ways
that undermine human values and interests.
Abstract
In recent years, Large Language Models (LLMs) have emerged as transformative
tools across numerous domains, impacting how professionals approach complex
analytical tasks. This systematic mapping study comprehensively examines the
application of LLMs throughout the Data Science lifecycle. By analyzing
relevant papers from Scopus and IEEE databases, we identify and categorize the
types of LLMs being applied, the specific stages and tasks of the data science
process they address, and the methodological approaches used for their
evaluation. Our analysis includes a detailed examination of evaluation metrics
employed across studies and systematically documents both positive
contributions and limitations of LLMs when applied to data science workflows.
This mapping provides researchers and practitioners with a structured
understanding of the current landscape, highlighting trends, gaps, and
opportunities for future research in this rapidly evolving intersection of LLMs
and data science.