Surf AI, Cybertino Lab
Abstract
We present CAIA, a benchmark exposing a critical blind spot in AI evaluation:
the inability of state-of-the-art models to operate in adversarial, high-stakes
environments where misinformation is weaponized and errors are irreversible.
While existing benchmarks measure task completion in controlled settings,
real-world deployment demands resilience against active deception. Using crypto
markets as a testbed where $30 billion was lost to exploits in 2024, we
evaluate 17 models on 178 time-anchored tasks requiring agents to distinguish
truth from manipulation, navigate fragmented information landscapes, and make
irreversible financial decisions under adversarial pressure.
Our results reveal a fundamental capability gap: without tools, even frontier
models achieve only 28% accuracy on tasks junior analysts routinely handle.
Tool augmentation improves performance but plateaus at 67.4% versus 80% human
baseline, despite unlimited access to professional resources. Most critically,
we uncover a systematic tool selection catastrophe: models preferentially
choose unreliable web search over authoritative data, falling for SEO-optimized
misinformation and social media manipulation. This behavior persists even when
correct answers are directly accessible through specialized tools, suggesting
foundational limitations rather than knowledge gaps. We also find that Pass@k
metrics mask dangerous trial-and-error behavior for autonomous deployment.
The implications extend beyond crypto to any domain with active adversaries,
e.g. cybersecurity, content moderation, etc. We release CAIA with contamination
controls and continuous updates, establishing adversarial robustness as a
necessary condition for trustworthy AI autonomy. The benchmark reveals that
current models, despite impressive reasoning scores, remain fundamentally
unprepared for environments where intelligence must survive active opposition.
Fakult at f ur Mathemat
Abstract
Black-box optimization (BBO) addresses problems where objectives are
accessible only through costly queries without gradients or explicit structure.
Classical derivative-free methods -- line search, direct search, and
model-based solvers such as Bayesian optimization -- form the backbone of BBO,
yet often struggle in high-dimensional, noisy, or mixed-integer settings.
Recent advances use machine learning (ML) and reinforcement learning (RL) to
enhance BBO: ML provides expressive surrogates, adaptive updates, meta-learning
portfolios, and generative models, while RL enables dynamic operator
configuration, robustness, and meta-optimization across tasks.
This paper surveys these developments, covering representative algorithms
such as NNs with the modular model-based optimization framework (mlrMBO),
zeroth-order adaptive momentum methods (ZO-AdaMM), automated BBO (ABBO),
distributed block-wise optimization (DiBB), partition-based Bayesian
optimization (SPBOpt), the transformer-based optimizer (B2Opt),
diffusion-model-based BBO, surrogate-assisted RL for differential evolution
(Surr-RLDE), robust BBO (RBO), coordinate-ascent model-based optimization with
relative entropy (CAS-MORE), log-barrier stochastic gradient descent (LB-SGD),
policy improvement with black-box (PIBB), and offline Q-learning with Mamba
backbones (Q-Mamba).
We also review benchmark efforts such as the NeurIPS 2020 BBO Challenge and
the MetaBox framework. Overall, we highlight how ML and RL transform classical
inexact solvers into more scalable, robust, and adaptive frameworks for
real-world optimization.