Abstract
The emergence of Superchips represents a significant advancement in
next-generation AI hardware. These Superchips employ a tightly coupled
heterogeneous architecture that integrates GPU and CPU on the same package,
which offers unprecedented computational power. However, there has been scant
research investigating how LLM training benefits from this new architecture. In
this work, for the first time, we study LLM training solutions based on
offloading for Superchips. We observe important differences between Superchips
and traditional loosely-coupled GPU-CPU architecture, which necessitate
revisiting prevailing assumptions about offloading. Based on that, we present
SuperOffload, a Superchip-centric offloading system that simultaneously uses
Hopper GPU, Grace CPU, and NVLink-C2C interconnect more efficiently.
SuperOffload accomplishes this via a combination of techniques, such as
adaptive weight offloading, bucketization repartitioning, Superchip-aware
casting, speculative execution, and a highly optimized Adam optimizer for Grace
CPUs. Our evaluation of SuperOffload on NVIDIA GH200 demonstrates up to 2.5x
throughput improvement compared to state-of-the-art offloading-based systems,
enabling training of up to 25B model on a single Superchip while achieving high
training throughput. We also extend SuperOffload with ZeRO-style data
parallelism and DeepSpeed-Ulysses sequence parallelism, enabling training of
13B model with sequence lengths up to 1 million tokens on 8 GH200 while
achieving 55% MFU.
Abstract
We consider a spectrum sharing problem where two users attempt to communicate
over N channels. The Primary User (PU) has prioritized transmissions and its
occupancy on each channel over time can be modeled as a Markov chain. The
Secondary User (SU) needs to determine which channels are free at each
time-slot and attempt opportunistic transmissions. The goal of the SU is to
maximize its own throughput, while simultaneously minimizing collisions with
the PU, and satisfying spectrum access constraints. To solve this problem, we
first decouple the multiple-channel problem into N single-channel problems. For
each decoupled problem, we prove that there exists an optimal threshold policy
that depends on the last observed PU occupancy and the freshness of this
occupancy information. Second, we establish the indexability of the decoupled
problems by analyzing the structure of the optimal threshold policy. Using this
structure, we derive a Whittle index-based scheduling policy that allocates SU
transmissions using the Age of Information (AoI) of accessed channels. We also
extend our insights to PU occupancy models that are correlated across channels
and incorporate learning of unknown Markov transition matrices into our
policies. Finally, we provide detailed numerical simulations that demonstrate
the performance gains of our approach.