Paid Search

FinSearchComp: Towards a Realistic, Expert-Level Evaluation of Financial Search and Reasoning

Columbia Business School

Rate this image: 😍 👍 👎

Abstract
Search has emerged as core infrastructure for LLM-based agents and is widely viewed as critical on the path toward more general intelligence. Finance is a particularly demanding proving ground: analysts routinely conduct complex, multi-step searches over time-sensitive, domain-specific data, making it ideal for assessing both search proficiency and knowledge-grounded reasoning. Yet no existing open financial datasets evaluate data searching capability of end-to-end agents, largely because constructing realistic, complicated tasks requires deep financial expertise and time-sensitive data is hard to evaluate. We present FinSearchComp, the first fully open-source agent benchmark for realistic, open-domain financial search and reasoning. FinSearchComp comprises three tasks -- Time-Sensitive Data Fetching, Simple Historical Lookup, and Complex Historical Investigation -- closely reproduce real-world financial analyst workflows. To ensure difficulty and reliability, we engage 70 professional financial experts for annotation and implement a rigorous multi-stage quality-assurance pipeline. The benchmark includes 635 questions spanning global and Greater China markets, and we evaluate 21 models (products) on it. Grok 4 (web) tops the global subset, approaching expert-level accuracy. DouBao (web) leads on the Greater China subset. Experimental analyses show that equipping agents with web search and financial plugins substantially improves results on FinSearchComp, and the country origin of models and tools impact performance significantly.By aligning with realistic analyst tasks and providing end-to-end evaluation, FinSearchComp offers a professional, high-difficulty testbed for complex financial search and reasoning.

👍 👎 ♥ Save

LOKI: Proactively Discovering Online Scam Websites by Mining Toxic Search Queries

Boston University

Abstract
Online e-commerce scams, ranging from shopping scams to pet scams, globally cause millions of dollars in financial damage every year. In response, the security community has developed highly accurate detection systems able to determine if a website is fraudulent. However, finding candidate scam websites that can be passed as input to these downstream detection systems is challenging: relying on user reports is inherently reactive and slow, and proactive systems issuing search engine queries to return candidate websites suffer from low coverage and do not generalize to new scam types. In this paper, we present LOKI, a system designed to identify search engine queries likely to return a high fraction of fraudulent websites. LOKI implements a keyword scoring model grounded in Learning Under Privileged Information (LUPI) and feature distillation from Search Engine Result Pages (SERPs). We rigorously validate LOKI across 10 major scam categories and demonstrate a 20.58 times improvement in discovery over both heuristic and data-driven baselines across all categories. Leveraging a small seed set of only 1,663 known scam sites, we use the keywords identified by our method to discover 52,493 previously unreported scams in the wild. Finally, we show that LOKI generalizes to previously-unseen scam categories, highlighting its utility in surfacing emerging threats.

Bidding

👍 👎 ♥ Save

Enhancing Generative Auto-bidding with Offline Reward Evaluation and Policy Search

Rate this image: 😍 👍 👎

Abstract
Auto-bidding is an essential tool for advertisers to enhance their advertising performance. Recent progress has shown that AI-Generated Bidding (AIGB), which formulates the auto-bidding as a trajectory generation task and trains a conditional diffusion-based planner on offline data, achieves superior and stable performance compared to typical offline reinforcement learning (RL)-based auto-bidding methods. However, existing AIGB methods still encounter a performance bottleneck due to their neglect of fine-grained generation quality evaluation and inability to explore beyond static datasets. To address this, we propose AIGB-Pearl (\emph{Planning with EvAluator via RL}), a novel method that integrates generative planning and policy optimization. The key to AIGB-Pearl is to construct a non-bootstrapped \emph{trajectory evaluator} to assign rewards and guide policy search, enabling the planner to optimize its generation quality iteratively through interaction. Furthermore, to enhance trajectory evaluator accuracy in offline settings, we incorporate three key techniques: (i) a Large Language Model (LLM)-based architecture for better representational capacity, (ii) hybrid point-wise and pair-wise losses for better score learning, and (iii) adaptive integration of expert feedback for better generalization ability. Extensive experiments on both simulated and real-world advertising systems demonstrate the state-of-the-art performance of our approach.

👍 👎 ♥ Save

Auto-bidding under Return-on-Spend Constraints with Uncertainty Quantification

Abstract
Auto-bidding systems are widely used in advertising to automatically determine bid values under constraints such as total budget and Return-on-Spend (RoS) targets. Existing works often assume that the value of an ad impression, such as the conversion rate, is known. This paper considers the more realistic scenario where the true value is unknown. We propose a novel method that uses conformal prediction to quantify the uncertainty of these values based on machine learning methods trained on historical bidding data with contextual features, without assuming the data are i.i.d. This approach is compatible with current industry systems that use machine learning to predict values. Building on prediction intervals, we introduce an adjusted value estimator derived from machine learning predictions, and show that it provides performance guarantees without requiring knowledge of the true value. We apply this method to enhance existing auto-bidding algorithms with budget and RoS constraints, and establish theoretical guarantees for achieving high reward while keeping RoS violations low. Empirical results on both simulated and real-world industrial datasets demonstrate that our approach improves performance while maintaining computational efficiency.

Personalization

👍 👎 ♥ Save

Green Recommender Systems: Understanding and Minimizing the Carbon Footprint of AI-Powered Personalization

University of Siegen, Go

Abstract
As global warming soars, the need to assess and reduce the environmental impact of recommender systems is becoming increasingly urgent. Despite this, the recommender systems community hardly understands, addresses, and evaluates the environmental impact of their work. In this study, we examine the environmental impact of recommender systems research by reproducing typical experimental pipelines. Based on our results, we provide guidelines for researchers and practitioners on how to minimize the environmental footprint of their work and implement green recommender systems - recommender systems designed to minimize their energy consumption and carbon footprint. Our analysis covers 79 papers from the 2013 and 2023 ACM RecSys conferences, comparing traditional "good old-fashioned AI" models with modern deep learning models. We designed and reproduced representative experimental pipelines for both years, measuring energy consumption using a hardware energy meter and converting it into CO2 equivalents. Our results show that papers utilizing deep learning models emit approximately 42 times more CO2 equivalents than papers using traditional models. On average, a single deep learning-based paper generates 2,909 kilograms of CO2 equivalents - more than the carbon emissions of a person flying from New York City to Melbourne or the amount of CO2 sequestered by one tree over 260 years. This work underscores the urgent need for the recommender systems and wider machine learning communities to adopt green AI principles, balancing algorithmic advancements and environmental responsibility to build a sustainable future with AI-powered personalization.

AI Insights

The authors provide a reproducible pipeline that measures real hardware energy use, not just theoretical FLOPs.
A detailed checklist urges authors to disclose energy budgets, CO₂ equivalents, and hardware specs for each experiment.
Comparative tables show deep‑learning recommenders emit 42× more CO₂ than classic matrix‑factorization models.
The paper argues environmental cost justification should link to tangible societal benefits, encouraging research.
It recommends low‑power hardware and algorithmic pruning to shrink the carbon footprint of future systems.
By framing sustainability as a research metric, the study invites curiosity about how green AI can coexist with high recommendation accuracy.

👍 👎 ♥ Save

Learning in Context: Personalizing Educational Content with Large Language Models to Enhance Student Learning

Tsinghua University, Tsia

Abstract
Standardized, one-size-fits-all educational content often fails to connect with students' individual backgrounds and interests, leading to disengagement and a perceived lack of relevance. To address this challenge, we introduce PAGE, a novel framework that leverages large language models (LLMs) to automatically personalize educational materials by adapting them to each student's unique context, such as their major and personal interests. To validate our approach, we deployed PAGE in a semester-long intelligent tutoring system and conducted a user study to evaluate its impact in an authentic educational setting. Our findings show that students who received personalized content demonstrated significantly improved learning outcomes and reported higher levels of engagement, perceived relevance, and trust compared to those who used standardized materials. This work demonstrates the practical value of LLM-powered personalization and offers key design implications for creating more effective, engaging, and trustworthy educational experiences.

AI Insights

Structured prompts steer LLMs to extract user profiles, generate personalized search queries, and rank lecture scripts by Instructional Accuracy, Clarity, and Logical Coherence.
Six learning‑effectiveness dimensions—Learning new Concepts, Deepening, Attractiveness, Efficiency, Stimulation, Dependability—were scored 1‑5 for 40 students, linking Personalization Relevance to higher engagement.
Table 10 shows higher Personalization Relevance consistently boosts Student Engagement and Trust.
Prompt design with explicit output requirements and example templates markedly improves content accuracy and relevance.
Recommended reading: “Natural Language Processing with Python,” “Deep Learning for Natural Language Processing,” and the paper “Learning in Context: A Framework for Personalized Education using Large Language Models.”

Direction on Data Science Organizations

👍 👎 ♥ Save

Qualitative Research in an Era of AI: A Pragmatic Approach to Data Analysis, Workflow, and Computation

Rice University, Virginia

Rate this image: 😍 👍 👎

Abstract
Rapid computational developments - particularly the proliferation of artificial intelligence (AI) - increasingly shape social scientific research while raising new questions about in-depth qualitative methods such as ethnography and interviewing. Building on classic debates about using computers to analyze qualitative data, we revisit longstanding concerns and assess possibilities and dangers in an era of automation, AI chatbots, and 'big data.' We first historicize developments by revisiting classical and emergent concerns about qualitative analysis with computers. We then introduce a typology of contemporary modes of engagement - streamlining workflows, scaling up projects, hybrid analytical approaches, and the sociology of computation - alongside rejection of computational analyses. We illustrate these approaches with detailed workflow examples from a large-scale ethnographic study and guidance for solo researchers. We argue for a pragmatic sociological approach that moves beyond dualisms of technological optimism versus rejection to show how computational tools - simultaneously dangerous and generative - can be adapted to support longstanding qualitative aims when used carefully in ways aligned with core methodological commitments.

AI Insights

The study maps four AI engagement modes—workflow streamlining, scaling, hybrid analysis, and the sociology of computation—beyond optimism–rejection.
A large‑scale ethnographic workflow example shows AI accelerating coding while preserving nuance.
Hybrid coding blends human insight with LLM prompts, cutting effort yet boosting reliability.
Warnings note that LLMs substituting participants can flatten identity groups, raising ethical stakes.
The authors urge empirical tests of AI’s effectiveness, urging comparison with traditional coding rigor.
Solo researchers receive step‑by‑step guidance on integrating AI without compromising methodological integrity.

Data Science Management

👍 👎 ♥ Save

Machine Learning-Driven Predictive Resource Management in Complex Science Workflows

Brookhaven National Lab

Rate this image: 😍 👍 👎

Abstract
The collaborative efforts of large communities in science experiments, often comprising thousands of global members, reflect a monumental commitment to exploration and discovery. Recently, advanced and complex data processing has gained increasing importance in science experiments. Data processing workflows typically consist of multiple intricate steps, and the precise specification of resource requirements is crucial for each step to allocate optimal resources for effective processing. Estimating resource requirements in advance is challenging due to a wide range of analysis scenarios, varying skill levels among community members, and the continuously increasing spectrum of computing options. One practical approach to mitigate these challenges involves initially processing a subset of each step to measure precise resource utilization from actual processing profiles before completing the entire step. While this two-staged approach enables processing on optimal resources for most of the workflow, it has drawbacks such as initial inaccuracies leading to potential failures and suboptimal resource usage, along with overhead from waiting for initial processing completion, which is critical for fast-turnaround analyses. In this context, our study introduces a novel pipeline of machine learning models within a comprehensive workflow management system, the Production and Distributed Analysis (PanDA) system. These models employ advanced machine learning techniques to predict key resource requirements, overcoming challenges posed by limited upfront knowledge of characteristics at each step. Accurate forecasts of resource requirements enable informed and proactive decision-making in workflow management, enhancing the efficiency of handling diverse, complex workflows across heterogeneous resources.

AI Insights

PanDA now runs a full ML pipeline that predicts memory, CPU, I/O, and walltime with sub‑second latency.
70 % of tasks are predicted within 5 % of actual usage, cutting idle time dramatically.
Future work includes clustering task attributes, adding domain knowledge, and a feedback loop for continuous model refinement.
Transfer learning across diverse scientific workflows is proposed to generalize the models beyond the current dataset.
The authors cite “pipecomp, a General Framework for the Evaluation of Computational Pipelines” and recommend “Robust Performance Metrics for Imbalanced Classification Problems” for deeper evaluation.

👍 👎 ♥ Save

Is Research Software Science a Metascience?

arXiv250913436v1 csSE

Abstract
As research increasingly relies on computational methods, the reliability of scientific results depends on the quality, reproducibility, and transparency of research software. Ensuring these qualities is critical for scientific integrity and discovery. This paper asks whether Research Software Science (RSS)--the empirical study of how research software is developed and used--should be considered a form of metascience, the science of science. Classification matters because it could affect recognition, funding, and integration of RSS into research improvement. We define metascience and RSS, compare their principles and objectives, and examine their overlaps. Arguments for classification highlight shared commitments to reproducibility, transparency, and empirical study of research processes. Arguments against portraying RSS as a specialized domain focused on a tool rather than the broader scientific enterprise. Our analysis finds RSS advances core goals of metascience, especially in computational reproducibility, and bridges technical, social, and cognitive aspects of research. Its classification depends on whether one adopts a broad definition of metascience--any empirical effort to improve science--or a narrow one focused on systemic and epistemological structures. We argue RSS is best understood as a distinct interdisciplinary domain that aligns with, and in some definitions fits within, metascience. Recognizing it as such can strengthen its role in improving reliability, justify funding, and elevate software development in research institutions. Regardless of classification, applying scientific rigor to research software ensures the tools of discovery meet the standards of the discoveries themselves.

AI Insights

RSS adopts empirical methods akin to Empirical Software Engineering to quantify software quality metrics.
The field’s core contribution is a reproducibility framework that maps software artifacts to experimental protocols.
Literature such as Bennett’s An Introduction to Metascience and Mausfeld’s Epsilon‑Metascience contextualizes RSS within broader meta‑research debates.
Ziemann et al.’s Five Pillars of Computational Reproducibility provides a practical checklist that RSS researchers routinely apply.
The FORRT framework offers a training curriculum that integrates open‑source practices with rigorous reproducibility standards.
RSS is positioned as an interdisciplinary bridge, linking cognitive science, sociology of science, and software engineering.
Recognizing RSS as a distinct domain can unlock targeted funding streams and institutional support for research software development.

Attribution

👍 👎 ♥ Save

Exploring Training Data Attribution under Limited Access Constraints

University of Illinois at

Rate this image: 😍 👍 👎

Abstract
Training data attribution (TDA) plays a critical role in understanding the influence of individual training data points on model predictions. Gradient-based TDA methods, popularized by \textit{influence function} for their superior performance, have been widely applied in data selection, data cleaning, data economics, and fact tracing. However, in real-world scenarios where commercial models are not publicly accessible and computational resources are limited, existing TDA methods are often constrained by their reliance on full model access and high computational costs. This poses significant challenges to the broader adoption of TDA in practical applications. In this work, we present a systematic study of TDA methods under various access and resource constraints. We investigate the feasibility of performing TDA under varying levels of access constraints by leveraging appropriately designed solutions such as proxy models. Besides, we demonstrate that attribution scores obtained from models without prior training on the target dataset remain informative across a range of tasks, which is useful for scenarios where computational resources are limited. Our findings provide practical guidance for deploying TDA in real-world environments, aiming to improve feasibility and efficiency under limited access.

👍 👎 ♥ Save

DeepACTIF: Efficient Feature Attribution via Activation Traces in Neural Sequence Models

Abstract
Feature attribution is essential for interpreting deep learning models, particularly in time-series domains such as healthcare, biometrics, and human-AI interaction. However, standard attribution methods, such as Integrated Gradients or SHAP, are computationally intensive and not well-suited for real-time applications. We present DeepACTIF, a lightweight and architecture-aware feature attribution method that leverages internal activations of sequence models to estimate feature importance efficiently. Focusing on LSTM-based networks, we introduce an inverse-weighted aggregation scheme that emphasises stability and magnitude of activations across time steps. Our evaluation across three biometric gaze datasets shows that DeepACTIF not only preserves predictive performance under severe feature reduction (top 10% of features) but also significantly outperforms established methods, including SHAP, IG, and DeepLIFT, in terms of both accuracy and statistical robustness. Using Wilcoxon signed-rank tests and effect size analysis, we demonstrate that DeepACTIF yields more informative feature rankings with significantly lower error across all top-k conditions (10 - 40%). Our experiments demonstrate that DeepACTIF not only reduces computation time and memory usage by orders of magnitude but also preserves model accuracy when using only top-ranked features. That makes DeepACTIF a viable solution for real-time interpretability on edge devices such as mobile XR headsets or embedded health monitors.

Interests not found

Help us improve your experience!