Abstract
Traditional named entity recognition (NER) aims to identify text mentions
into pre-defined entity types. Continual Named Entity Recognition (CNER) is
introduced since entity categories are continuously increasing in various
real-world scenarios. However, existing continual learning (CL) methods for NER
face challenges of catastrophic forgetting and semantic shift of non-entity
type. In this paper, we propose GenCNER, a simple but effective Generative
framework for CNER to mitigate the above drawbacks. Specifically, we skillfully
convert the CNER task into sustained entity triplet sequence generation problem
and utilize a powerful pre-trained seq2seq model to solve it. Additionally, we
design a type-specific confidence-based pseudo labeling strategy along with
knowledge distillation (KD) to preserve learned knowledge and alleviate the
impact of label noise at the triplet level. Experimental results on two
benchmark datasets show that our framework outperforms previous
state-of-the-art methods in multiple CNER settings, and achieves the smallest
gap compared with non-CL results.
HumboldtUniversitt zu
Abstract
Over the past decade, the proliferation of public and enterprise data lakes
has fueled intensive research into data discovery, aiming to identify the most
relevant data from vast and complex corpora to support diverse user tasks.
Significant progress has been made through the development of innovative index
structures, similarity measures, and querying infrastructures. Despite these
advances, a critical aspect remains overlooked: relevance is time-varying.
Existing discovery methods largely ignore this temporal dimension, especially
when explicit date/time metadata is missing. To fill this gap, we outline a
vision for a data discovery system that incorporates the temporal dimension of
data. Specifically, we define the problem of temporally-valid data discovery
and argue that addressing it requires techniques for version discovery,
temporal lineage inference, change log synthesis, and time-aware data
discovery. We then present a system architecture to deliver these techniques,
before we summarize research challenges and opportunities. As such, we lay the
foundation for a new class of data discovery systems, transforming how we
interact with evolving data lakes.
AI Insights - Blend’s hybrid cache mixes in‑memory storage with on‑demand time travel, cutting latency for large‑scale discovery.
- A classifier separates unrelated datasets from temporally linked versions, enabling precise lineage construction.
- Heuristics extract change logs from version histories, giving users a navigable timeline of data evolution.
- Content‑based and version‑specific queries are decoupled, letting the system infer query intent from minimal input.
- The architecture infers temporal context even without explicit timestamps, using schema and metadata cues.
- Recommended: “Delta Lake: High‑performance ACID table storage over cloud object stores” for robust versioning foundations.