Beijing Institute of Tech
Abstract
A growing trend in modern data analysis is the integration of data management
with learning, guided by accuracy, latency, and cost requirements. In practice,
applications draw data of different formats from many sources. In the
meanwhile, the objectives and budgets change over time. Existing systems handle
these applications across databases, analysis libraries, and tuning services.
Such fragmentation leads to complex user interaction, limited adaptability,
suboptimal performance, and poor extensibility across components. To address
these challenges, we present Aixel, a unified, adaptive, and extensible system
for AI-powered data analysis. The system organizes work across four layers:
application, task, model, and data. The task layer provides a declarative
interface to capture user intent, which is parsed into an executable operator
plan. An optimizer compiles and schedules this plan to meet specified goals in
accuracy, latency, and cost. The task layer coordinates the execution of data
and model operators, with built-in support for reuse and caching to improve
efficiency. The model layer offers versioned storage for index, metadata,
tensors, and model artifacts. It supports adaptive construction, task-aligned
drift detection, and safe updates that reuse shared components. The data layer
provides unified data management capabilities, including indexing,
constraint-aware discovery, task-aligned selection, and comprehensive feature
management. With the above designed layers, Aixel delivers a user friendly,
adaptive, efficient, and extensible system.
HumboldtUniversitt zu
Abstract
Over the past decade, the proliferation of public and enterprise data lakes
has fueled intensive research into data discovery, aiming to identify the most
relevant data from vast and complex corpora to support diverse user tasks.
Significant progress has been made through the development of innovative index
structures, similarity measures, and querying infrastructures. Despite these
advances, a critical aspect remains overlooked: relevance is time-varying.
Existing discovery methods largely ignore this temporal dimension, especially
when explicit date/time metadata is missing. To fill this gap, we outline a
vision for a data discovery system that incorporates the temporal dimension of
data. Specifically, we define the problem of temporally-valid data discovery
and argue that addressing it requires techniques for version discovery,
temporal lineage inference, change log synthesis, and time-aware data
discovery. We then present a system architecture to deliver these techniques,
before we summarize research challenges and opportunities. As such, we lay the
foundation for a new class of data discovery systems, transforming how we
interact with evolving data lakes.
AI Insights - Blendâs hybrid cache mixes inâmemory storage with onâdemand time travel, cutting latency for largeâscale discovery.
- A classifier separates unrelated datasets from temporally linked versions, enabling precise lineage construction.
- Heuristics extract change logs from version histories, giving users a navigable timeline of data evolution.
- Contentâbased and versionâspecific queries are decoupled, letting the system infer query intent from minimal input.
- The architecture infers temporal context even without explicit timestamps, using schema and metadata cues.
- Recommended: âDelta Lake: Highâperformance ACID table storage over cloud object storesâ for robust versioning foundations.