LikeBench: Evaluating Subjective Likability in LLMs for Personalization

UC Santa Barbara

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

The study examines the performance of various language models on a dataset with diverse user profiles, highlighting the importance of personality traits and conversation style in determining likability. [3]
The results show that top-performing models (GPT-5 and Claude Sonnet 4) are stable across different user types, while others struggle to adapt. [3]
The introduction of Dynamic User Profile (DUP) improves performance for top models without additional training, indicating the potential of lightweight preference tracking. [3]
Likability: The degree to which a language model is perceived as likable or relatable by users. [3]
Conversation style preferences: The way individuals prefer to communicate, including aspects such as directness, formality, and conversation length. [3]
Lightweight preference tracking (DUP) shows promise in improving performance without additional training. [3]
The study's taxonomy of personality traits and conversation style preferences provides a comprehensive framework for understanding user behavior. [2]

Abstract
A personalized LLM should remember user facts, apply them correctly, and adapt over time to provide responses that the user prefers. Existing LLM personalization benchmarks are largely centered on two axes: accurately recalling user information and accurately applying remembered information in downstream tasks. We argue that a third axis, likability, is both subjective and central to user experience, yet under-measured by current benchmarks. To measure likability holistically, we introduce LikeBench, a multi-session, dynamic evaluation framework that measures likability across multiple dimensions by how much an LLM can adapt over time to a user's preferences to provide more likable responses. In LikeBench, the LLMs engage in conversation with a simulated user and learn preferences only from the ongoing dialogue. As the interaction unfolds, models try to adapt to responses, and after each turn, they are evaluated for likability across seven dimensions by the same simulated user. To the best of our knowledge, we are the first to decompose likability into multiple diagnostic metrics: emotional adaptation, formality matching, knowledge adaptation, reference understanding, conversation length fit, humor fit, and callback, which makes it easier to pinpoint where a model falls short. To make the simulated user more realistic and discriminative, LikeBench uses fine-grained, psychologically grounded descriptive personas rather than the coarse high/low trait rating based personas used in prior work. Our benchmark shows that strong memory performance does not guarantee high likability: DeepSeek R1, with lower memory accuracy (86%, 17 facts/profile), outperformed Qwen3 by 28% on likability score despite Qwen3's higher memory accuracy (93%, 43 facts/profile). Even SOTA models like GPT-5 adapt well in short exchanges but show only limited robustness in longer, noisier interactions.

Why we are recommending this paper?
Due to your Interest in: Personalization Platform

This paper directly addresses personalization, a core interest, by evaluating how LLMs can learn and adapt to user preferences. Given your focus on personalization platforms and CRM optimization, understanding how to measure subjective likability is crucial.

Towards Proactive Personalization through Profile Customization for Individual Users in Dialogues

Zhejiang University

Rate paper: 👍 👎 ♥ Save

AI Insights

The proposed approach demonstrates the importance of rich contextual and behavioral signals for personalization, highlighting the need for more nuanced understanding of user characteristics and behaviors. [3]
Previous work in user profile and recommender systems emphasizes the importance of rich contextual and behavioral signals for personalization. [3]
Personalization in Conversational AI The paper presents a new approach to making conversational AI more personalized. [3]
It focuses on understanding users' preferences and adapting to their needs, rather than relying solely on static information. [3]
The paper presents a comprehensive approach to personalization in conversational AI, focusing on aligning large language models (LLMs) with user preferences. [2]

Abstract
The deployment of Large Language Models (LLMs) in interactive systems necessitates a deep alignment with the nuanced and dynamic preferences of individual users. Current alignment techniques predominantly address universal human values or static, single-turn preferences, thereby failing to address the critical needs of long-term personalization and the initial user cold-start problem. To bridge this gap, we propose PersonalAgent, a novel user-centric lifelong agent designed to continuously infer and adapt to user preferences. PersonalAgent constructs and dynamically refines a unified user profile by decomposing dialogues into single-turn interactions, framing preference inference as a sequential decision-making task. Experiments show that PersonalAgent achieves superior performance over strong prompt-based and policy optimization baselines, not only in idealized but also in noisy conversational contexts, while preserving cross-session preference consistency. Furthermore, human evaluation confirms that PersonalAgent excels at capturing user preferences naturally and coherently. Our findings underscore the importance of lifelong personalization for developing more inclusive and adaptive conversational agents. Our code is available here.

Why we are recommending this paper?
Due to your Interest in: Personalization Platform

This work explores proactive personalization strategies within conversational systems, aligning with your interest in personalization and data-driven CRM approaches. The focus on individual user preferences is particularly relevant to your domain.

DataFlow: An LLM-Driven Framework for Unified Data Preparation and Workflow Automation in the Era of Data-Centric AI

Peking University

Rate paper: 👍 👎 ♥ Save

AI Insights

It organizes operators along multiple orthogonal categorization dimensions, including modality, core vs. [3]
PyTorch: An open-source machine learning library for Python. [3]
domain-specific, and functional categories. [2]
DataFlow is a unified data preparation framework that supports end-to-end LLM data preparation workflows. [1]

Abstract
The rapidly growing demand for high-quality data in Large Language Models (LLMs) has intensified the need for scalable, reliable, and semantically rich data preparation pipelines. However, current practices remain dominated by ad-hoc scripts and loosely specified workflows, which lack principled abstractions, hinder reproducibility, and offer limited support for model-in-the-loop data generation. To address these challenges, we present DataFlow, a unified and extensible LLM-driven data preparation framework. DataFlow is designed with system-level abstractions that enable modular, reusable, and composable data transformations, and provides a PyTorch-style pipeline construction API for building debuggable and optimizable dataflows. The framework consists of nearly 200 reusable operators and six domain-general pipelines spanning text, mathematical reasoning, code, Text-to-SQL, agentic RAG, and large-scale knowledge extraction. To further improve usability, we introduce DataFlow-Agent, which automatically translates natural-language specifications into executable pipelines via operator synthesis, pipeline planning, and iterative verification. Across six representative use cases, DataFlow consistently improves downstream LLM performance. Our math, code, and text pipelines outperform curated human datasets and specialized synthetic baselines, achieving up to +3\% execution accuracy in Text-to-SQL over SynSQL, +7\% average improvements on code benchmarks, and 1--3 point gains on MATH, GSM8K, and AIME. Moreover, a unified 10K-sample dataset produced by DataFlow enables base models to surpass counterparts trained on 1M Infinity-Instruct data. These results demonstrate that DataFlow provides a practical and high-performance substrate for reliable, reproducible, and scalable LLM data preparation, and establishes a system-level foundation for future data-centric AI development.

Why we are recommending this paper?
Due to your Interest in: Data Driven CRM

Given your interest in data-driven CRM optimization, this paper’s focus on scalable data preparation pipelines for LLMs is highly relevant. The framework’s automation capabilities could significantly improve your data workflows.

Beyond Text-to-SQL: Autonomous Research-Driven Database Exploration with DAR

Mantis Analytics

Rate paper: 👍 👎 ♥ Save

Abstract
Large language models can already query databases, yet most existing systems remain reactive: they rely on explicit user prompts and do not actively explore data. We introduce DAR (Data Agnostic Researcher), a multi-agent system that performs end-to-end database research without human-initiated queries. DAR orchestrates specialized AI agents across three layers: initialization (intent inference and metadata extraction), execution (SQL and AI-based query synthesis with iterative validation), and synthesis (report generation with built-in quality control). All reasoning is executed directly inside BigQuery using native generative AI functions, eliminating data movement and preserving data governance. On a realistic asset-incident dataset, DAR completes the full analytical task in 16 minutes, compared to 8.5 hours for a professional analyst (approximately 32x times faster), while producing useful pattern-based insights and evidence-grounded recommendations. Although human experts continue to offer deeper contextual interpretation, DAR excels at rapid exploratory analysis. Overall, this work shifts database interaction from query-driven assistance toward autonomous, research-driven exploration within cloud data warehouses.

Why we are recommending this paper?
Due to your Interest in: Data Driven CRM

This paper’s exploration of autonomous database research using LLMs aligns with the need for intelligent data access within a CRM context. The ability to proactively explore data without explicit prompts is a valuable capability.

Ladder Up, Memory Down: Low-Cost Fine-Tuning With Side Nets

Universit de Lorraine

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Some studies have shown that parameter-efficient fine-tuning can be more effective than in-context learning for certain tasks. [3]
Fine-tuning: The process of adjusting a pre-trained model to fit a specific task or dataset. [3]
In-context learning: A method of training models by providing them with the context in which they will be used, rather than relying on explicit fine-tuning. [3]
The field of large language models (LLMs) is rapidly evolving with new research papers being published regularly. [2]
There is a growing interest in developing open-source instruction data for math-related tasks to accelerate AI research in this area. [1]

Abstract
Fine-tuning large language models (LLMs) is often limited by the memory available on commodity GPUs. Parameter-efficient fine-tuning (PEFT) methods such as QLoRA reduce the number of trainable parameters, yet still incur high memory usage induced by the backward pass in the full model. We revisit Ladder Side Tuning (LST), a rarely explored PEFT technique that adds a lightweight side network, and show that it matches QLoRA's compute scaling slope while cutting peak memory by 50\%. Across different downstream benchmarks spanning natural language understanding, mathematical and LLM-critic tasks, LST has competitive performance with QLoRA's accuracy on average while being much more memory-efficient. This efficiency enables fine-tuning of 7B-parameter models on a single 12 GB consumer GPU with 2k-token contexts, requiring no gradient checkpointing\textemdash conditions under which QLoRA exhausts memory. Beyond memory efficiency, we also establish scaling laws showing that LST scales similarly to QLoRA. We exploit Ladder's architectural flexibility by introducing xLadder, a depth-extended variant that increases effective depth via cross-connections and shortens chain-of-thought (CoT) at fixed parameter count. Ladder is strong when memory is the bottleneck; xLadder builds on this by enabling deeper reasoning without additional memory overhead.

Why we are recommending this paper?
Due to your Interest in: CRM Optimization

This research addresses the practical challenges of fine-tuning LLMs, a key component of MLOps and personalization. The focus on reducing memory usage is directly applicable to optimizing LLM deployments for your applications.

Interests not found

Help us improve your experience!