Hi!

Your personalized paper recommendations for 01 to 05 December, 2025.
AI for Product Management
Mercor
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
Abstract
We introduce the first version of the AI Consumer Index (ACE), a benchmark for assessing whether frontier AI models can perform high-value consumer tasks. ACE contains a hidden heldout set of 400 test cases, split across four consumer activities: shopping, food, gaming, and DIY. We are also open sourcing 80 cases as a devset with a CC-BY license. For the ACE leaderboard we evaluated 10 frontier models (with websearch turned on) using a novel grading methodology that dynamically checks whether relevant parts of the response are grounded in the retrieved web sources. GPT 5 (Thinking = High) is the top-performing model, scoring 56.1%, followed by o3 Pro (Thinking = On) (55.2%) and GPT 5.1 (Thinking = High) (55.1%). Models differ across domains, and in Shopping the top model scores under 50%. For some requests (such as giving the correct price or providing working links), models are highly prone to hallucination. Overall, ACE shows a substantial gap between the performance of even the best models and consumers' AI needs.
AI Summary
  • The ACE benchmark is a comprehensive evaluation of conversational AI models, covering various domains such as DIY, food, gaming, and shopping. [3]
  • The benchmark consists of multiple workflows for each domain, with specific instructions and criteria for the models to follow. [3]
  • Gemini 2.5 Flash (On), Gemini 3 Pro (High), and o3 (On) also demonstrate strong performance across various domains. [3]
  • The benchmark highlights the strengths and weaknesses of each model, providing valuable insights for developers and researchers to improve their conversational AI systems. [3]
  • ACE-v1-heldout: A subset of the ACE benchmark used for evaluation, consisting of 100 cases per domain. [3]
  • Bootstrapped confidence intervals: A statistical method used to estimate the uncertainty of mean scores by resampling with replacement from the original dataset. [3]
  • Domain: A specific category or area of expertise within the ACE benchmark, such as DIY, food, gaming, or shopping. [3]
  • Model: A conversational AI system being evaluated on the ACE benchmark, including models like Gemini 2.5 Flash (On), GPT-5 (High), and o3 (On). [3]
  • The ACE benchmark provides a comprehensive evaluation of conversational AI models across various domains. [3]
  • The results show that GPT-5 (High) and GPT-5.1 (High) perform exceptionally well in most domains, achieving high mean scores and confidence intervals. [2]
Zhejiang University
Rate paper: 👍 👎 ♥ Save
Abstract
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.
AI Summary
  • The proposed benchmark may not capture all aspects of human intelligence, such as common sense or creativity. [3]
  • These models are like super-smart computers that can understand and generate human-like text. [3]
  • The authors want to make sure these models can solve math problems, which is an important part of being intelligent. [3]
  • The paper discusses the challenges of evaluating large language models (LLMs) and proposes a new benchmark for measuring their performance. [2]
Vision Setting for Tech Teams
Guilin University of Tech
Rate paper: 👍 👎 ♥ Save
Abstract
Active visual perception refers to the ability of a system to dynamically engage with its environment through sensing and action, allowing it to modify its behavior in response to specific goals or uncertainties. Unlike passive systems that rely solely on visual data, active visual perception systems can direct attention, move sensors, or interact with objects to acquire more informative data. This approach is particularly powerful in complex environments where static sensing methods may not provide sufficient information. Active visual perception plays a critical role in numerous applications, including robotics, autonomous vehicles, human-computer interaction, and surveillance systems. However, despite its significant promise, there are several challenges that need to be addressed, including real-time processing of complex visual data, decision-making in dynamic environments, and integrating multimodal sensory inputs. This paper explores both the opportunities and challenges inherent in active visual perception, providing a comprehensive overview of its potential, current research, and the obstacles that must be overcome for broader adoption.
AI Summary
  • Multi-agent cloud robotics: A type of robotics where multiple agents work together in a cloud computing environment to achieve tasks. [3]
  • Visual perception technology has been applied in various fields such as smart farming, agriculture intelligence, climate change detection, and autonomous vehicles. [2]
University of Electronic
Rate paper: 👍 👎 ♥ Save
Abstract
Fine-tuning large pretrained vision-language models (VLMs) has emerged as a prevalent paradigm for downstream adaptation, yet it faces a critical trade-off between domain specificity and domain generalization (DG) ability. Current methods typically fine-tune a universal model on the entire dataset, which potentially compromises the ability to generalize to unseen domains. To fill this gap, we provide a theoretical understanding of the generalization ability for VLM fine-tuning, which reveals that training multiple parameter-efficient expert models on partitioned source domains leads to better generalization than fine-tuning a universal model. Inspired by this finding, we propose a two-step domain-expert-Guided DG (GuiDG) framework. GuiDG first employs prompt tuning to obtain source domain experts, then introduces a Cross-Modal Attention module to guide the fine-tuning of the vision encoder via adaptive expert integration. To better evaluate few-shot DG, we construct ImageNet-DG from ImageNet and its variants. Extensive experiments on standard DG benchmarks and ImageNet-DG demonstrate that GuiDG improves upon state-of-the-art fine-tuning methods while maintaining efficiency.
Product Strategy
Eindhoven University of
Rate paper: 👍 👎 ♥ Save
Paper visualization
Rate image: 👍 👎
Abstract
As a key circular economy strategy, remanufacturing allows original equipment manufacturers (OEMs) to reduce waste by restoring used products to ``as-new'' conditions. This paper investigates an OEM's optimal remanufacturing business model by incorporating consumer perceptions into price and production quantity decisions. We analyze three alternative models: no remanufacturing, OEM in-house remanufacturing, and third-party remanufacturer (TPR) authorized remanufacturing. We extend the authorization with a two-part tariff contract and consider a stochastic market size. Through a numerical approach, we optimize price and quantity decisions based on consumer perceptions and develop a hierarchical decision roadmap to guide model selection. Our findings show that when consumer's perceived value of remanufactured products is high, OEM in-house remanufacturing is most profitable and reduces environmental impacts, but generally leads to a market dominated by remanufactured products. In contrast, when consumer's perceived value of remanufactured products is moderate and TPR remanufacturing significantly increases the perceived value of new products, the TPR-authorized remanufacturing is most profitable. It typically boosts total market sales, but accordingly increases environmental impacts. In addition, sensitivity analysis indicates that two-part authorization contracts are more advanced in meeting stringent environmental requirements than one-part contracts. Incorporating market size stochasticity enhances system profitability while keeping environmental impacts within a limited scope.
AI Summary
  • Consumer perceptions play a crucial role in determining the OEM's selection of the most profitable remanufacturing business models. [3]
  • The numerical results provide managerial insights for model selection, with the OEM first evaluating consumer attitudes towards remanufactured products in their target market. [3]
  • Optimal remanufacturing models generally expand total market sales by attracting new customer segments and can even increase sales of new products through the contrast effect. [3]
  • Degree of assimilation and contrast effects (|β|): The degree to which consumers perceive remanufactured products as similar or dissimilar to new products. [3]
  • The WTP discount factor for remanufactured products (α) and the degree of assimilation and contrast effects (|β|) are key factors to consider when evaluating consumer perceptions. [2]
  • TPR-authorized remanufacturing (Model T): A business model where a third-party provider (TPR) is authorized by the OEM to engage in remanufacturing activities. [1]
ITMO University
Rate paper: 👍 👎 ♥ Save
Abstract
The future of digital marketing lies in the convergence of human creativity and generative AI, where insight, strategy, and storytelling are co-authored by intelligent systems. We present MindFuse, a brave new explainable generative AI framework designed to act as a strategic partner in the marketing process. Unlike conventional LLM applications that stop at content generation, MindFuse fuses CTR-based content AI-guided co-creation with large language models to extract, interpret, and iterate on communication narratives grounded in real advertising data. MindFuse operates across the full marketing lifecycle: from distilling content pillars and customer personas from competitor campaigns to recommending in-flight optimizations based on live performance telemetry. It uses attention-based explainability to diagnose ad effectiveness and guide content iteration, while aligning messaging with strategic goals through dynamic narrative construction and storytelling. We introduce a new paradigm in GenAI for marketing, where LLMs not only generate content but reason through it, adapt campaigns in real time, and learn from audience engagement patterns. Our results, validated in agency deployments, demonstrate up to 12 times efficiency gains, setting the stage for future integration with empirical audience data (e.g., GWI, Nielsen) and full-funnel attribution modeling. MindFuse redefines AI not just as a tool, but as a collaborative agent in the creative and strategic fabric of modern marketing.
AI Summary
  • The importance of AI in advertising and marketing The role of large language models in transforming industries The need for human creativity in storytelling AI Co-creation: The process of using AI to generate creative ideas or content, often in collaboration with humans. [3]
  • AI is becoming increasingly important in advertising and marketing, but it still requires human input for effective storytelling. [2]

Interests not found

We did not find any papers that match the below interests. Try other terms also consider if the content exists in arxiv.org.
  • Product Management
  • Product Roadmap
You can edit or add more interests any time.