LLMs for Productivity

Cache What Lasts: Token Retention for Memory-Bounded KV Cache in LLMs

Yale University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Memory and computation remain core bottlenecks in long-horizon LLM inference due to the quadratic cost of self-attention and the ever-growing key-value (KV) cache. Existing strategies for memory-bounded inference, such as quantization, offloading, or heuristic KV eviction, either incur high orchestration costs or rely on unreliable attention-based proxies of importance. We propose TRIM-KV, a novel approach that learns each token's intrinsic importance at creation time via a lightweight retention gate. Each gate predicts a scalar retention score that decays over time, reflecting the long-term utility of the token for a specific layer and head. Tokens with low scores are evicted when the memory budget is exceeded, ensuring that the cache always contains the most critical tokens. TRIM-KV is trained efficiently through distillation from a frozen LLM combined with a capacity loss, requiring only gate fine-tuning and adding negligible inference overhead. Across mathematical reasoning (GSM8K, MATH-500, AIME24), procedural generation (LongProc), conversational long-memory benchmarks (LongMemEval), and long-context understanding (LongBench and SCBench), TRIM-KV consistently outperforms strong eviction and learnable retrieval baselines, especially in low-memory regimes. Remarkably, it even surpasses full-cache models in some settings, showing that selective retention can serve as a form of regularization, suppressing noise from uninformative tokens. Qualitative analyses further reveal that learned retention scores align with human intuition, naturally recovering heuristics such as sink tokens, sliding windows, and gist compression without explicit design. Beyond efficiency, retention scores provide insights into layer- and head-specific roles, suggesting a new path toward LLM interpretability.

AI Summary

Attention Mechanism: A neural network component that allows the model to focus on specific parts of the input when generating output. [3]
The authors also mention the work of 'Hermann Ebbinghaus' on memory and its contribution to experimental psychology. [3]
The paper presents a novel approach to addressing the memory bottleneck in large language models (LLMs) by proposing a new attention mechanism called 'LongLora' that efficiently fine-tunes long-context LLMs. [2]
The authors do not provide a detailed analysis of the computational complexity of their method. [1]

Benchmarking LLM Agents for Wealth-Management Workflows

University of Edinburgh

Rate paper: 👍 👎 ♥ Save

Abstract
Modern work relies on an assortment of digital collaboration tools, yet routine processes continue to suffer from human error and delay. To address this gap, this dissertation extends TheAgentCompany with a finance-focused environment and investigates whether a general purpose LLM agent can complete representative wealth-management tasks both accurately and economically. This study introduces synthetic domain data, enriches colleague simulations, and prototypes an automatic task-generation pipeline. The study aims to create and assess an evaluation set that can meaningfully measure an agent's fitness for assistant-level wealth management work. We construct a benchmark of 12 task-pairs for wealth management assistants spanning retrieval, analysis, and synthesis/communication, with explicit acceptance criteria and deterministic graders. We seeded a set of new finance-specific data and introduced a high vs. low-autonomy variant of every task. The paper concluded that agents are limited less by mathematical reasoning and more so by end-to-end workflow reliability, and meaningfully affected by autonomy level, and that incorrect evaluation of models have hindered benchmarking.

AI for Productivity Tools

InnoGym: Benchmarking the Innovation Potential of AI Agents

Zhejiang University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.

AI Summary

The proposed benchmark may not capture all aspects of human intelligence, such as common sense or creativity. [3]
These models are like super-smart computers that can understand and generate human-like text. [3]
The authors want to make sure these models can solve math problems, which is an important part of being intelligent. [3]
The paper discusses the challenges of evaluating large language models (LLMs) and proposes a new benchmark for measuring their performance. [2]

Eval Factsheets: A Structured Framework for Documenting AI Evaluations

Meta

Rate paper: 👍 👎 ♥ Save

Abstract
The rapid proliferation of benchmarks has created significant challenges in reproducibility, transparency, and informed decision-making. However, unlike datasets and models -- which benefit from structured documentation frameworks like Datasheets and Model Cards -- evaluation methodologies lack systematic documentation standards. We introduce Eval Factsheets, a structured, descriptive framework for documenting AI system evaluations through a comprehensive taxonomy and questionnaire-based approach. Our framework organizes evaluation characteristics across five fundamental dimensions: Context (Who made the evaluation and when?), Scope (What does it evaluate?), Structure (With what the evaluation is built?), Method (How does it work?) and Alignment (In what ways is it reliable/valid/robust?). We implement this taxonomy as a practical questionnaire spanning five sections with mandatory and recommended documentation elements. Through case studies on multiple benchmarks, we demonstrate that Eval Factsheets effectively captures diverse evaluation paradigms -- from traditional benchmarks to LLM-as-judge methodologies -- while maintaining consistency and comparability. We hope Eval Factsheets are incorporated into both existing and newly released evaluation frameworks and lead to more transparency and reproducibility.

AI Summary

Factsheets are designed to provide a comprehensive overview of the model's capabilities, limitations, and potential biases. [3]
The authors argue that Factsheets can improve transparency, reproducibility, and informed decision-making in AI research. [3]
Model Description: A detailed description of the model's architecture, training data, and hyperparameters. [3]
Data Statement: A statement that describes the data used to train the model, including its source, quality, and any potential biases. [3]
Evaluation Metrics: The metrics used to evaluate the model's performance, such as accuracy, precision, and recall. [3]
Use Cases: Examples of how the model can be used in real-world applications. [3]
The paper proposes a framework for evaluation of large language models (LLMs) called Factsheets. [2]

Economics of Productivity

Measuring Agents in Production

UC Berkeley

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
AI agents are actively running in production across diverse industries, yet little is publicly known about which technical approaches enable successful real-world deployments. We present the first large-scale systematic study of AI agents in production, surveying 306 practitioners and conducting 20 in-depth case studies via interviews across 26 domains. We investigate why organizations build agents, how they build them, how they evaluate them, and what the top development challenges are. We find that production agents are typically built using simple, controllable approaches: 68% execute at most 10 steps before requiring human intervention, 70% rely on prompting off-the-shelf models instead of weight tuning, and 74% depend primarily on human evaluation. Reliability remains the top development challenge, driven by difficulties in ensuring and evaluating agent correctness. Despite these challenges, simple yet effective methods already enable agents to deliver impact across diverse industries. Our study documents the current state of practice and bridges the gap between research and deployment by providing researchers visibility into production challenges while offering practitioners proven patterns from successful deployments.

AI Summary

Agentic AI: A type of artificial intelligence that enables systems to perform tasks autonomously and adapt to changing situations. [3]
Agent Architecture: The design and structure of an Agentic AI system, including its components and how they interact with each other. [3]
Human dominates prompt construction in Agentic AI systems. [2]

A new family of models with generalized orientation in data envelopment analysis

Universidad de Valencia

Rate paper: 👍 👎 ♥ Save

Abstract
In the framework of data envelopment analysis, we review directional models \citep{Chambers1996, Chambers1998, Briec1997} and show that they are inadequate when inputs and outputs are improved simultaneously under constant returns to scale. Conversely, we introduce a new family of quadratically constrained models with generalized orientation and demonstrate that these models overcome this limitation. Furthermore, we extend the Farrell measure of technical efficiency using these new models. Additionally, we prove that the family of generalized oriented models satisfies some desired monotonicity properties. Finally, we show that the new models, although being quadratically constrained, can be solved through linear programs in a fundamental particular case.

AI Summary

The paper discusses two types of models used in economics: Linear Directional (LO) models and Quadratic Orientation (QO) models. [3]
LO models are useful when improving inputs or outputs separately, but not suitable for simultaneous improvement under the CRS assumption. [3]
], ], The paper discusses linear directional (LO) models and introduces Quadratic Orientation (QO) models, which are applicable in specific circumstances. [2]

Help us improve your experience!