Papers from 22 to 26 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Search
👍 👎 ♄ Save
NanKai University, TianJi
Paper visualization
Rate this image: 😍 👍 👎
Abstract
The BESIII experiment is a symmetric e+ e- collider experiment operating at center-of-mass energies from 2.0 to 4.95 GeV. With the world's largest threshold production data set, including 10 billion J/psi events, 2.7 billion psi(3686) events, 7.9 fb^{-1} of D meson pairs from psi(3770) decay, and 7.33 fb^{-1} of D_s D_s^* events between 4.128 and 4.226 GeV, we are able to probe for new physics through precision tests of the Standard Model, searches for exotic low-mass particles, and investigations of forbidden or rare decay processes. In this talk, we report recent studies on Beyond the Standard Model physics conducted by the BESIII collaboration, including searches for axion-like particles, dark photons, QCD axions, and invisible decays of K_S^0. In addition, a series of rare charm decay processes, including searches for lepton and baryon number violation, flavor-changing neutral current processes, and charmonium weak decays, are also investigated to search for new physics at BESIII.
AI Insights
  • BESIII’s 10 billion J/ψ sample allows sub‑percent tests of lepton‑flavor universality in rare decays.
  • The 7.9 fb⁻Âč of ψ(3770) → D D̄ pairs provides a clean arena for D⁰–D̄⁰ mixing studies.
  • Axion‑like searches in J/Ïˆâ€Żâ†’â€ŻÎłâ€Ż+ invisible have set couplings below 10⁻⁔ GeV⁻Âč for 1–100 MeV masses.
  • Dark‑photon limits from eâșe⁻ → γ Aâ€Č → γ ℓâșℓ⁻ exclude Δ > 10⁻³ for 10–200 MeV Aâ€Č.
  • Measurement of J/Ïˆâ€Żâ†’â€ŻD_s⁻ Kâș at 10⁻⁶ branching tests factorization in charmonium weak decays.
  • New 4.2 GeV data will double the DsDs sample, enabling rare Dsâ€Żâ†’â€Żâ„“ÎœÎł studies.
  • BESIII’s open‑access data and arXiv preprints accelerate global BSM fits and theory work.
👍 👎 ♄ Save
Abstract
Search agents connect LLMs to the Internet, enabling access to broader and more up-to-date information. However, unreliable search results may also pose safety threats to end users, establishing a new threat surface. In this work, we conduct two in-the-wild experiments to demonstrate both the prevalence of low-quality search results and their potential to misguide agent behaviors. To counter this threat, we introduce an automated red-teaming framework that is systematic, scalable, and cost-efficient, enabling lightweight and harmless safety assessments of search agents. Building on this framework, we construct the SafeSearch benchmark, which includes 300 test cases covering five categories of risks (e.g., misinformation and indirect prompt injection). Using this benchmark, we evaluate three representative search agent scaffolds, covering search workflow, tool-calling, and deep research, across 7 proprietary and 8 open-source backend LLMs. Our results reveal substantial vulnerabilities of LLM-based search agents: when exposed to unreliable websites, the highest ASR reached 90.5% for GPT-4.1-mini under a search workflow setting. Moreover, our analysis highlights the limited effectiveness of common defense practices, such as reminder prompting. This emphasizes the value of our framework in promoting transparency for safer agent development. Our codebase and test cases are publicly available: https://github.com/jianshuod/SafeSearch.
Personalization
👍 👎 ♄ Save
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Visual personalization is essential in user-facing AI systems such as smart homes and healthcare, where aligning model behavior with user-centric concepts is critical. However, recent large Vision-Language Models (VLMs), despite their broad applicability, remain underexplored in their ability to adapt to individual users. In this paper, we introduce MMPB, the first extensive benchmark for evaluating VLMs on personalization. MMPB comprises 10k image-query pairs and includes 111 personalizable concepts across four categories: humans, animals, objects, and characters, with the human category enriched with preference-grounded queries. We structure personalization into three main task types, each highlighting a different key property of VLMs. Using 23 widely used VLMs including both open- and closed-source models, we evaluate personalization performance via a three-stage protocol: concept injection, multi-turn dialogue, and personalized querying. Our findings indicate that most VLMs (including some closed-source models) struggle with personalization, particularly in maintaining consistency over dialogue, handling user preferences, and adapting to visual cues. Our analysis reveals that the challenges in VLM personalization (such as refusal behaviors and long-context forgetting) highlight substantial room for improvement. By identifying these limitations and offering a scalable benchmark, MMPB offers valuable insights and a solid foundation for future research toward truly personalized multi-modal AI. Project Page: aidaslab.github.io/MMPB
👍 👎 ♄ Save
Abstract
Large language model (LLM) personalization aims to tailor model behavior to individual users based on their historical interactions. However, its effectiveness is often hindered by two key challenges: the \textit{cold-start problem}, where users with limited history provide insufficient context for accurate personalization, and the \textit{biasing problem}, where users with abundant but skewed history cause the model to overfit to narrow preferences. We identify both issues as symptoms of a common underlying limitation, i.e., the inability to model collective knowledge across users. To address this, we propose a local-global memory framework (LoGo) that combines the personalized local memory with a collective global memory that captures shared interests across the population. To reconcile discrepancies between these two memory sources, we introduce a mediator module designed to resolve conflicts between local and global signals. Extensive experiments on multiple benchmarks demonstrate that LoGo consistently improves personalization quality by both warming up cold-start users and mitigating biased predictions. These results highlight the importance of incorporating collective knowledge to enhance LLM personalization.
Deep Learning
👍 👎 ♄ Save
University of Pennsylvann
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Given the widespread use of deep learning models in safety-critical applications, ensuring that the decisions of such models are robust against adversarial exploitation is of fundamental importance. In this thesis, we discuss recent progress toward designing algorithms that exhibit desirable robustness properties. First, we discuss the problem of adversarial examples in computer vision, for which we introduce new technical results, training paradigms, and certification algorithms. Next, we consider the problem of domain generalization, wherein the task is to train neural networks to generalize from a family of training distributions to unseen test distributions. We present new algorithms that achieve state-of-the-art generalization in medical imaging, molecular identification, and image classification. Finally, we study the setting of jailbreaking large language models (LLMs), wherein an adversarial user attempts to design prompts that elicit objectionable content from an LLM. We propose new attacks and defenses, which represent the frontier of progress toward designing robust language-based agents.
AI Insights
  • Random erasing data augmentation injects stochastic occlusions during training, boosting pixel‑level robustness.
  • Stability training enforces Lipschitz continuity across layers, yielding provable robustness margins.
  • Robust prompt optimization tailors LLM inputs to shrink jailbreak‑induced decision space.
  • Universal adversarial attacks generate a single perturbation that transfers across many inputs, breaking input‑specific defenses.
  • Randomness in SGD can amplify or dampen adversarial vulnerability, depending on learning‑rate schedules.
  • Tooling—automated augmentation pipelines and reproducibility frameworks—drives consistent robustness across labs.
  • “Essentials of Robust Control” links classical control theory to deep learning, providing a rigorous basis for safe neural systems.
👍 👎 ♄ Save
Istanbul Medeniyet Univer
Abstract
Deep learning optimizers are optimization algorithms that enable deep neural networks to learn. The effectiveness of learning is highly dependent on the optimizer employed in the training process. Alongside the rapid advancement of deep learning, a wide range of optimizers with different approaches have been developed. This study aims to provide a review of various optimizers that have been proposed and received attention in the literature. From Stochastic gradient descent to the most recent ones such as Momentum, AdamW, Sophia, and Muon in chronological order, optimizers are examined individually, and their distinctive features are highlighted in the study. The update rule of each optimizer is presented in detail, with an explanation of the associated concepts and variables. The techniques applied by these optimizers, their contributions to the optimization process, and their default hyperparameter settings are also discussed. In addition, insights are offered into the open challenges encountered in the optimization of deep learning models. Thus, a comprehensive resource is provided both for understanding the current state of optimizers and for identifying potential areas of future development.
Information Retrieval
👍 👎 ♄ Save
University of Melbourne
Abstract
A range of approaches have been proposed for estimating the accuracy or robustness of the measured performance of IR methods. One is to use bootstrapping of test sets, which, as we confirm, provides an estimate of variation in performance. For IR methods that rely on a seed, such as those that involve machine learning, another approach is to use a random set of seeds to examine performance variation. Using three different IR tasks we have used such randomness to examine a range of traditional statistical learning models and transformer-based learning models. While the statistical models are stable, the transformer models show huge variation as seeds are changed. In 9 of 11 cases the F1-scores (in the range 0.0--1.0) had a standard deviation of over 0.075; while 7 of 11 precision values (also in the range 0.0--1.0) had a standard deviation of over 0.125. This is in a context where differences of less than 0.02 have been used as evidence of method improvement. Our findings highlight the vulnerability of transformer models to training instabilities and moreover raise questions about the reliability of previous results, thus underscoring the need for rigorous evaluation practices.
AI Insights
  • The authors advocate publishing full score distributions instead of single-point metrics to capture performance variability.
  • Transparency of random seed values is highlighted as essential for reproducibility across studies.
  • The paper cites Sakai (2006) on bootstrap evaluation and Reimers & Gurevych (2017) on score‑distribution reporting as key methodological references.
  • A random seed setting is defined as the initial value for a random number generator in ML algorithms.
  • A score distribution is defined as a graphical representation of model performance across evaluation metrics.
  • Transformer models’ instability can invalidate marginal improvements below 0.02 in F1 or precision.
  • Although focused on fake‑news detection and sentiment analysis, the authors argue the conclusions generalize to any ML task involving stochastic training.
👍 👎 ♄ Save
Northeastern University
Abstract
Relation extraction (RE) aims to identify semantic relations between entities in unstructured text. Although recent work extends traditional RE to multimodal scenarios, most approaches still adopt classification-based paradigms with fused multimodal features, representing relations as discrete labels. This paradigm has two significant limitations: (1) it overlooks structural constraints like entity types and positional cues, and (2) it lacks semantic expressiveness for fine-grained relation understanding. We propose \underline{R}etrieval \underline{O}ver \underline{C}lassification (ROC), a novel framework that reformulates multimodal RE as a retrieval task driven by relation semantics. ROC integrates entity type and positional information through a multimodal encoder, expands relation labels into natural language descriptions using a large language model, and aligns entity-relation pairs via semantic similarity-based contrastive learning. Experiments show that our method achieves state-of-the-art performance on the benchmark datasets MNRE and MORE and exhibits stronger robustness and interpretability.
AI Insights
  • ROC’s hierarchical semantic modeling captures global and local context, teasing apart semantically close relations.
  • An attention module suppresses noise, focusing learning on key entities that drive relation inference.
  • Encoder depth is tuned to alignment strength: shallow for loosely coupled image‑text pairs, deeper for tightly coupled signals.
  • ROC stays competitive on purely textual benchmarks, proving its versatility beyond multimodal inputs.
  • Contrastive learning aligns entity‑relation pairs by semantic similarity, turning retrieval into a fine‑grained matching game.
  • Retrieval‑over‑classification boosts interpretability, letting users trace predictions back to the nearest semantic prototype.
  • For deeper dives, see “Multimodal Learning for Visual Question Answering” and the ROC GitHub repo for code and demos.
Ranking
👍 👎 ♄ Save
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Mainstream ranking approaches typically follow a Generator-Evaluator two-stage paradigm, where a generator produces candidate lists and an evaluator selects the best one. Recent work has attempted to enhance performance by expanding the number of candidate lists, for example, through multi-generator settings. However, ranking involves selecting a recommendation list from a combinatorially large space. Simply enlarging the candidate set remains ineffective, and performance gains quickly saturate. At the same time, recent advances in large recommendation models have shown that end-to-end one-stage models can achieve promising performance with the expectation of scaling laws. Motivated by this, we revisit ranking from a generator-only one-stage perspective. We theoretically prove that, for any (finite Multi-)Generator-Evaluator model, there always exists a generator-only model that achieves strictly smaller approximation error to the optimal ranking policy, while also enjoying scaling laws as its size increases. Building on this result, we derive an evidence upper bound of the one-stage optimization objective, from which we find that one can leverage a reward model trained on real user feedback to construct a reference policy in a group-relative manner. This reference policy serves as a practical surrogate of the optimal policy, enabling effective training of a large generator-only ranker. Based on these insights, we propose GoalRank, a generator-only ranking framework. Extensive offline experiments on public benchmarks and large-scale online A/B tests demonstrate that GoalRank consistently outperforms state-of-the-art methods.
👍 👎 ♄ Save
University of Illinois at
Abstract
Social media feeds have become central to the Internet. Among the most visible are trending feeds, which rank content deemed timely and relevant. To examine how feed signals influence behaviors and perceptions, we conducted a randomized experiment (n = 585) simulating Reddit's r/popular feed. By having participants view identical sets of posts in different orders, we isolate the effects of rank and social proof on engagement and perceived relevance, trustworthiness, and quality. We found that lower-ranked posts received about 40% less engagement, despite participants rarely reporting rank as a factor in their choices. In contrast, neither rank nor social proof shifted perceptions across the three dimensions. We also observed demographic patterns: older participants were more skeptical of trending content, while those with less formal education expressed greater trust. Overall, our findings show that algorithmic curation implicitly steers attention, with implications for platform design, research on algorithmic influence, and policy.
AI Insights
  • Social proof metrics (score, comments) amplify engagement independently of rank, hinting at a hidden multiplier effect on attention.
  • The experiment’s counter‑balanced feed rotation isolates rank effects, revealing a 40 % engagement drop for lower‑ranked posts even when participants claim rank is irrelevant.
  • Older users exhibit heightened skepticism toward trending content, while those with less formal education display higher trust, indicating demographic moderators of algorithmic influence.
  • Definition: Algorithmic rank is the algorithm‑determined position of a post in a feed.
  • Definition: Social proof are post‑level engagement metrics (score, comments) indicating popularity.
Unsubscribe from these updates