University of Pennsylvann
Abstract
Mainstream news organizations shape public perception not only directly
through the articles they publish but also through the choices they make about
which topics to cover (or ignore) and how to frame the issues they do decide to
cover. However, measuring these subtle forms of media bias at scale remains a
challenge. Here, we introduce a large, ongoing (from January 1, 2024 to
present), near real-time dataset and computational framework developed to
enable systematic study of selection and framing bias in news coverage. Our
pipeline integrates large language models (LLMs) with scalable, near-real-time
news scraping to extract structured annotations -- including political lean,
tone, topics, article type, and major events -- across hundreds of articles per
day. We quantify these dimensions of coverage at multiple levels -- the
sentence level, the article level, and the publisher level -- expanding the
ways in which researchers can analyze media bias in the modern news landscape.
In addition to a curated dataset, we also release an interactive web platform
for convenient exploration of these data. Together, these contributions
establish a reusable methodology for studying media bias at scale, providing
empirical resources for future research. Leveraging the breadth of the corpus
over time and across publishers, we also present some examples (focused on the
150,000+ articles examined in 2024) that illustrate how this novel data set can
reveal insightful patterns in news coverage and bias, supporting academic
research and real-world efforts to improve media accountability.
Abstract
Neural networks have revolutionized numerous fields, yet they remain
vulnerable to a critical flaw: the tendency to learn implicit biases, spurious
correlations between certain attributes and target labels in training data.
These biases are often more prevalent and easier to learn, causing models to
rely on superficial patterns rather than task-relevant features necessary for
generalization. Existing methods typically rely on strong assumptions, such as
prior knowledge of these biases or access to bias-conflicting samples, i.e.,
samples that contradict spurious correlations and counterbalance bias-aligned
samples, samples that conform to these spurious correlations. However, such
assumptions are often impractical in real-world settings. We propose BLADE
({B}ias-{L}inked {A}daptive {DE}biasing), a generative debiasing framework that
requires no prior knowledge of bias or bias-conflicting samples. BLADE first
trains a generative model to translate images across bias domains while
preserving task-relevant features. Then, it adaptively refines each image with
its synthetic counterpart based on the image's susceptibility to bias. To
encourage robust representations, BLADE aligns an image with its
bias-translated synthetic counterpart that shares task-relevant features but
differs in bias, while misaligning it with samples sharing the same bias. We
evaluate BLADE on multiple benchmark datasets and show that it significantly
outperforms state-of-the-art methods. Notably, it exceeds the closest baseline
by an absolute margin of around 18% on the corrupted CIFAR-10 dataset under the
worst group setting, establishing a new benchmark in bias mitigation and
demonstrating its potential for developing more robust deep learning models
without explicit supervision.