University of Gttingen
Abstract
Multi-agent debate (MAD) has demonstrated the ability to augment collective
intelligence by scaling test-time compute and leveraging expertise. Current
frameworks for multi-agent debate are often designed towards tool use, lack
integrated evaluation, or provide limited configurability of agent personas,
response generators, discussion paradigms, and decision protocols. We introduce
MALLM (Multi-Agent Large Language Models), an open-source framework that
enables systematic analysis of MAD components. MALLM offers more than 144
unique configurations of MAD, including (1) agent personas (e.g., Expert,
Personality), (2) response generators (e.g., Critical, Reasoning), (3)
discussion paradigms (e.g., Memory, Relay), and (4) decision protocols (e.g.,
Voting, Consensus). MALLM uses simple configuration files to define a debate.
Furthermore, MALLM can load any textual Huggingface dataset (e.g., MMLU-Pro,
WinoGrande) and provides an evaluation pipeline for easy comparison of MAD
configurations. MALLM is tailored towards researchers and provides a window
into the heart of multi-agent debate, facilitating the understanding of its
components and their interplay.
AI Insights - MALLMās openāsource repo hosts 144 distinct debate setups, letting researchers swap agent personas, response generators, and discussion paradigms with a single config file.
- The evaluation pipeline automatically benchmarks any Huggingface text dataset, such as MMLUāPro or WinoGrande, across all chosen decision protocols.
- Config files expose fineāgrained knobsārepeats, max turns, concurrent API requests, and sample sizeāenabling reproducible, largeāscale experiments.
- āStay Focused: Problem Drift in MultiāAgent Debateā offers a deep dive into dynamic debate contexts that MALLM can simulate.
- āVoting or Consensus? DecisionāMaking in MultiāAgent Debateā compares protocol efficacy and is a mustāread for protocol designers.
- MALLMās modular design lets you plug in new agent personas or response generators without touching the core codebase.
- High compute demands and a steep learning curve for the config syntax are the main practical hurdles to keep in mind.
Tsinghua University
Abstract
Large language models (LLMs), a recent advance in deep learning and machine
intelligence, have manifested astonishing capacities, now considered among the
most promising for artificial general intelligence. With human-like
capabilities, LLMs have been used to simulate humans and serve as AI assistants
across many applications. As a result, great concern has arisen about whether
and under what circumstances LLMs think and behave like real human agents.
Rationality is among the most important concepts in assessing human behavior,
both in thinking (i.e., theoretical rationality) and in taking action (i.e.,
practical rationality). In this work, we propose the first benchmark for
evaluating the omnibus rationality of LLMs, covering a wide range of domains
and LLMs. The benchmark includes an easy-to-use toolkit, extensive experimental
results, and analysis that illuminates where LLMs converge and diverge from
idealized human rationality. We believe the benchmark can serve as a
foundational tool for both developers and users of LLMs.
AI Insights - Collective rationality lets multiple LLMs collaborate and decide, a new research frontier!
- The benchmark mixes human judgment and automated metrics, showing individual LLMs excel but teams lag.
- Shortfalls appear in decisionāmaking, problemāsolving, and reasoning, revealing coordination gaps.
- Improving collective rationality requires better alignment, smarter coordination, and richer metrics.
- Foundational texts like Collective Intelligence: Making a Good Thing Better guide these efforts!
- Critiques highlight heavy human evaluation and limited metrics, urging more objective measures.
- The study urges developers to craft AI that is not only rational alone but also trustworthy in teams.