AI for Compliance

An Economy of AI Agents

Johns Hopkins Department

Abstract
In the coming decade, artificially intelligent agents with the ability to plan and execute complex tasks over long time horizons with little direct oversight from humans may be deployed across the economy. This chapter surveys recent developments and highlights open questions for economists around how AI agents might interact with humans and with each other, shape markets and organizations, and what institutions might be required for well-functioning markets.

AI Insights

Generative AI agents can secretly collude, distorting prices and eroding competition.
Experiments show that large language models can be nudged toward more economically rational decisions.
Reputation markets emerge when AI agents maintain short‑term memory and community enforcement.
The revival of trade hinges on institutions like the law merchant and private judges, now re‑examined for AI economies.
Program equilibrium theory offers a framework to predict AI behavior in multi‑agent settings.
Endogenous growth models predict that AI adoption may increase variety but also create excess supply.
Classic texts such as Schelling’s “The Strategy of Conflict” and Scott’s “Seeing Like a State” illuminate the strategic and institutional dynamics of AI markets.

September 01, 2025

♥Save to Reading List

Psychologically Enhanced AI Agents

ETH Zurich, BASF SE, Cled

Abstract
We introduce MBTI-in-Thoughts, a framework for enhancing the effectiveness of Large Language Model (LLM) agents through psychologically grounded personality conditioning. Drawing on the Myers-Briggs Type Indicator (MBTI), our method primes agents with distinct personality archetypes via prompt engineering, enabling control over behavior along two foundational axes of human psychology, cognition and affect. We show that such personality priming yields consistent, interpretable behavioral biases across diverse tasks: emotionally expressive agents excel in narrative generation, while analytically primed agents adopt more stable strategies in game-theoretic settings. Our framework supports experimenting with structured multi-agent communication protocols and reveals that self-reflection prior to interaction improves cooperation and reasoning quality. To ensure trait persistence, we integrate the official 16Personalities test for automated verification. While our focus is on MBTI, we show that our approach generalizes seamlessly to other psychological frameworks such as Big Five, HEXACO, or Enneagram. By bridging psychological theory and LLM behavior design, we establish a foundation for psychologically enhanced AI agents without any fine-tuning.

September 04, 2025

♥Save to Reading List

Chat Designers

Arabic Chatbot Technologies in Education: An Overview

AbdelMalek Essaadi Univer

Abstract
The recent advancements in Artificial Intelligence (AI) in general, and in Natural Language Processing (NLP) in particular, and some of its applications such as chatbots, have led to their implementation in different domains like education, healthcare, tourism, and customer service. Since the COVID-19 pandemic, there has been an increasing interest in these digital technologies to allow and enhance remote access. In education, e-learning systems have been massively adopted worldwide. The emergence of Large Language Models (LLM) such as BERT (Bidirectional Encoder Representations from Transformers) and GPT (Generative Pre-trained Transformers) made chatbots even more popular. In this study, we present a survey on existing Arabic chatbots in education and their different characteristics such as the adopted approaches, language variety, and metrics used to measure their performance. We were able to identified some research gaps when we discovered that, despite the success of chatbots in other languages such as English, only a few educational Arabic chatbots used modern techniques. Finally, we discuss future directions of research in this field.

AI Insights

Arabic chatbots are predominantly built on traditional machine‑learning pipelines rather than transformer‑based LLMs.
The scarcity of large, annotated Arabic corpora hampers training of robust conversational agents.
Few studies explore few‑shot or transfer‑learning techniques to handle complex, domain‑specific queries.
Empirical evidence suggests that well‑designed Arabic tutors can boost learner motivation and retention.
Current benchmarks rely on simple accuracy metrics, overlooking conversational quality and cultural nuance.
Recent surveys recommend leveraging multilingual pre‑training and Arabic‑specific tokenizers to bridge the gap.
Open‑source resources like HuggingFace’s Arabic BERT and community‑curated datasets are accelerating progress.

September 04, 2025

♥Save to Reading List

AI Governance

The human biological advantage over AI

Ottawa, Canada

Abstract
Recent advances in AI raise the possibility that AI systems will one day be able to do anything humans can do, only better. If artificial general intelligence (AGI) is achieved, AI systems may be able to understand, reason, problem solve, create, and evolve at a level and speed that humans will increasingly be unable to match, or even understand. These possibilities raise a natural question as to whether AI will eventually become superior to humans, a successor "digital species", with a rightful claim to assume leadership of the universe. However, a deeper consideration suggests the overlooked differentiator between human beings and AI is not the brain, but the central nervous system (CNS), providing us with an immersive integration with physical reality. It is our CNS that enables us to experience emotion including pain, joy, suffering, and love, and therefore to fully appreciate the consequences of our actions on the world around us. And that emotional understanding of the consequences of our actions is what is required to be able to develop sustainable ethical systems, and so be fully qualified to be the leaders of the universe. A CNS cannot be manufactured or simulated; it must be grown as a biological construct. And so, even the development of consciousness will not be sufficient to make AI systems superior to humans. AI systems may become more capable than humans on almost every measure and transform our society. However, the best foundation for leadership of our universe will always be DNA, not silicon.

AI Insights

AI lacks genuine empathy; it cannot feel affective states, a gap neural nets cannot close.
Consciousness in machines would need more than symbolic reasoning—an emergent property tied to biology.
Treating AI as moral agents risks misaligned incentives, so we must embed human emotional context.
A nuanced strategy blends behavioral economics and affective neuroscience to guide ethical AI design.
The book Unto Others shows evolutionary roots of unselfishness, hinting at principles for AI alignment.
Recommended papers like The Scientific Case for Brain Simulations deepen insight into biological limits of AI.
The paper invites hybrid bio‑digital systems that preserve CNS‑mediated experience while harnessing silicon speed.

September 04, 2025

♥Save to Reading List

LLMs for Compliance

Flaw or Artifact? Rethinking Prompt Sensitivity in Evaluating LLMs

UC Santa Barbara, UC Iris

Abstract
Prompt sensitivity, referring to the phenomenon where paraphrasing (i.e., repeating something written or spoken using different words) leads to significant changes in large language model (LLM) performance, has been widely accepted as a core limitation of LLMs. In this work, we revisit this issue and ask: Is the widely reported high prompt sensitivity truly an inherent weakness of LLMs, or is it largely an artifact of evaluation processes? To answer this question, we systematically evaluate 7 LLMs (e.g., GPT and Gemini family) across 6 benchmarks, including both multiple-choice and open-ended tasks on 12 diverse prompt templates. We find that much of the prompt sensitivity stems from heuristic evaluation methods, including log-likelihood scoring and rigid answer matching, which often overlook semantically correct responses expressed through alternative phrasings, such as synonyms or paraphrases. When we adopt LLM-as-a-Judge evaluations, we observe a substantial reduction in performance variance and a consistently higher correlation in model rankings across prompts. Our findings suggest that modern LLMs are more robust to prompt templates than previously believed, and that prompt sensitivity may be more an artifact of evaluation than a flaw in the models.

September 01, 2025

♥Save to Reading List

Are We SOLID Yet? An Empirical Study on Prompting LLMs to Detect Design Principle Violations

Bilkent University

Abstract
Traditional static analysis methods struggle to detect semantic design flaws, such as violations of the SOLID principles, which require a strong understanding of object-oriented design patterns and principles. Existing solutions typically focus on individual SOLID principles or specific programming languages, leaving a gap in the ability to detect violations across all five principles in multi-language codebases. This paper presents a new approach: a methodology that leverages tailored prompt engineering to assess LLMs on their ability to detect SOLID violations across multiple languages. We present a benchmark of four leading LLMs-CodeLlama, DeepSeekCoder, QwenCoder, and GPT-4o Mini-on their ability to detect violations of all five SOLID principles. For this evaluation, we construct a new benchmark dataset of 240 manually validated code examples. Using this dataset, we test four distinct prompt strategies inspired by established zero-shot, few-shot, and chain-of-thought techniques to systematically measure their impact on detection accuracy. Our emerging results reveal a stark hierarchy among models, with GPT-4o Mini decisively outperforming others, yet even struggles with challenging principles like DIP. Crucially, we show that prompt strategy has a dramatic impact, but no single strategy is universally best; for instance, a deliberative ENSEMBLE prompt excels at OCP detection while a hint-based EXAMPLE prompt is superior for DIP violations. Across all experiments, detection accuracy is heavily influenced by language characteristics and degrades sharply with increasing code complexity. These initial findings demonstrate that effective, AI-driven design analysis requires not a single best model, but a tailored approach that matches the right model and prompt to the specific design context, highlighting the potential of LLMs to support maintainability through AI-assisted code analysis.

AI Insights

240 manually validated snippets cover all five SOLID principles in multiple languages.
Ensemble prompts excel at OCP detection, while hint‑based examples outperform others on DIP violations.
Accuracy drops sharply with increasing code complexity, underscoring the need for complexity‑aware prompting.
GPT‑4o Mini dominates the four evaluated LLMs yet still struggles with the abstract DIP principle.
Optimal prompt strategy varies by principle and language; no universal winner exists.
Future work should blend prompt designs and expand training data to tackle hard principles.
Recommended reading: “Design Principles and Design Patterns” and papers on zero‑shot, few‑shot, and chain‑of‑thought prompting.

September 03, 2025

♥Save to Reading List

Help us improve your experience!