Abstract
Reward machines are an established tool for dealing with reinforcement
learning problems in which rewards are sparse and depend on complex sequences
of actions. However, existing algorithms for learning reward machines assume an
overly idealized setting where rewards have to be free of noise. To overcome
this practical limitation, we introduce a novel type of reward machines, called
stochastic reward machines, and an algorithm for learning them. Our algorithm,
based on constraint solving, learns minimal stochastic reward machines from the
explorations of a reinforcement learning agent. This algorithm can easily be
paired with existing reinforcement learning algorithms for reward machines and
guarantees to converge to an optimal policy in the limit. We demonstrate the
effectiveness of our algorithm in two case studies and show that it outperforms
both existing methods and a naive approach for handling noisy reward functions.
Eindhoven University of T
Abstract
The curse of dimensionality renders Reinforcement Learning (RL) impractical
in many real-world settings with exponentially large state and action spaces.
Yet, many environments exhibit exploitable structure that can accelerate
learning. To formalize this idea, we study RL in Block Markov Decision
Processes (BMDPs). BMDPs model problems with large observation spaces, but
where transition dynamics are fully determined by latent states. Recent
advances in clustering methods have enabled the efficient recovery of this
latent structure. However, a regret analysis that exploits these techniques to
determine their impact on learning performance remained open. We are now
addressing this gap by providing a regret analysis that explicitly leverages
clustering, demonstrating that accurate latent state estimation can indeed
effectively speed up learning.
Concretely, this paper analyzes a two-phase RL algorithm for BMDPs that first
learns the latent structure through random exploration and then switches to an
optimism-guided strategy adapted to the uncovered structure. This algorithm
achieves a regret that is $O(\sqrt{T}+n)$ on a large class of BMDPs susceptible
to clustering. Here, $T$ denotes the number of time steps, $n$ is the
cardinality of the observation space, and the Landau notation $O(\cdot)$ holds
up to constants and polylogarithmic factors. This improves the best prior
bound, $O(\sqrt{T}+n^2)$, especially when $n$ is large. Moreover, we prove that
no algorithm can achieve lower regret uniformly on this same class of BMDPs.
This establishes that, on this class, the algorithm achieves asymptotic
optimality.
AI Insights - Sharp martingaleâdifference concentration bounds give highâprobability control of valueâfunction gaps.
- Random exploration is quantified by AzumaâHoeffding inequalities, ensuring accurate latentâstate clustering.
- The optimismâguided phase uses a regret decomposition that isolates clustering error effects.
- A novel spectralânorm bound on random matrices from observationâtoâlatent maps underpins the analysis.
- The paper references BoucheronâLugosiâMassartâs âConcentration Inequalitiesâ and Troppâs matrix tail bounds.
- Definitions: âConcentration Inequalityâ bounds deviation probability; âMartingale Difference Sequenceâ is incremental martingale.
- The results show clustering reduces regret from O(âT+nÂČ) to the optimal O(âT+n), a dramatic speedâup.