LLMs for Compliance

Towards Human-Centered RegTech: Unpacking Professionals' Strategies and Needs for Using LLMs Safely

City University of Hongk

Abstract
Large Language Models are profoundly changing work patterns in high-risk professional domains, yet their application also introduces severe and underexplored compliance risks. To investigate this issue, we conducted semi-structured interviews with 24 highly-skilled knowledge workers from industries such as law, healthcare, and finance. The study found that these experts are commonly concerned about sensitive information leakage, intellectual property infringement, and uncertainty regarding the quality of model outputs. In response, they spontaneously adopt various mitigation strategies, such as actively distorting input data and limiting the details in their prompts. However, the effectiveness of these spontaneous efforts is limited due to a lack of specific compliance guidance and training for Large Language Models. Our research reveals a significant gap between current NLP tools and the actual compliance needs of experts. This paper positions these valuable empirical findings as foundational work for building the next generation of Human-Centered, Compliance-Driven Natural Language Processing for Regulatory Technology (RegTech), providing a critical human-centered perspective and design requirements for engineering NLP systems that can proactively support expert compliance workflows.

👍 👎 ♥ Save

SafeEvalAgent: Toward Agentic and Self-Evolving Safety Evaluation of LLMs

Shanghai Artificial InteI

Abstract
The rapid integration of Large Language Models (LLMs) into high-stakes domains necessitates reliable safety and compliance evaluation. However, existing static benchmarks are ill-equipped to address the dynamic nature of AI risks and evolving regulations, creating a critical safety gap. This paper introduces a new paradigm of agentic safety evaluation, reframing evaluation as a continuous and self-evolving process rather than a one-time audit. We then propose a novel multi-agent framework SafeEvalAgent, which autonomously ingests unstructured policy documents to generate and perpetually evolve a comprehensive safety benchmark. SafeEvalAgent leverages a synergistic pipeline of specialized agents and incorporates a Self-evolving Evaluation loop, where the system learns from evaluation results to craft progressively more sophisticated and targeted test cases. Our experiments demonstrate the effectiveness of SafeEvalAgent, showing a consistent decline in model safety as the evaluation hardens. For instance, GPT-5's safety rate on the EU AI Act drops from 72.50% to 36.36% over successive iterations. These findings reveal the limitations of static assessments and highlight our framework's ability to uncover deep vulnerabilities missed by traditional methods, underscoring the urgent need for dynamic evaluation ecosystems to ensure the safe and responsible deployment of advanced AI.

AI Governance

👍 👎 ♥ Save

An Adaptive Responsible AI Governance Framework for Decentralized Organizations

Abstract
This paper examines the assessment challenges of Responsible AI (RAI) governance efforts in globally decentralized organizations through a case study collaboration between a leading research university and a multinational enterprise. While there are many proposed frameworks for RAI, their application in complex organizational settings with distributed decision-making authority remains underexplored. Our RAI assessment, conducted across multiple business units and AI use cases, reveals four key patterns that shape RAI implementation: (1) complex interplay between group-level guidance and local interpretation, (2) challenges translating abstract principles into operational practices, (3) regional and functional variation in implementation approaches, and (4) inconsistent accountability in risk oversight. Based on these findings, we propose an Adaptive RAI Governance (ARGO) Framework that balances central coordination with local autonomy through three interdependent layers: shared foundation standards, central advisory resources, and contextual local implementation. We contribute insights from academic-industry collaboration for RAI assessments, highlighting the importance of modular governance approaches that accommodate organizational complexity while maintaining alignment with responsible AI principles. These lessons offer practical guidance for organizations navigating the transition from RAI principles to operational practice within decentralized structures.

👍 👎 ♥ Save

Towards a Framework for Supporting the Ethical and Regulatory Certification of AI Systems

St Plten University of

Abstract
Artificial Intelligence has rapidly become a cornerstone technology, significantly influencing Europe's societal and economic landscapes. However, the proliferation of AI also raises critical ethical, legal, and regulatory challenges. The CERTAIN (Certification for Ethical and Regulatory Transparency in Artificial Intelligence) project addresses these issues by developing a comprehensive framework that integrates regulatory compliance, ethical standards, and transparency into AI systems. In this position paper, we outline the methodological steps for building the core components of this framework. Specifically, we present: (i) semantic Machine Learning Operations (MLOps) for structured AI lifecycle management, (ii) ontology-driven data lineage tracking to ensure traceability and accountability, and (iii) regulatory operations (RegOps) workflows to operationalize compliance requirements. By implementing and validating its solutions across diverse pilots, CERTAIN aims to advance regulatory compliance and to promote responsible AI innovation aligned with European standards.

Chat Designers

👍 👎 ♥ Save

Understanding Collaboration between Professional Designers and Decision-making AI: A Case Study in the Workplace

CyberAgent, The Univerisy

Abstract
The rapid development of artificial intelligence (AI) has fundamentally transformed creative work practices in the design industry. Existing studies have identified both opportunities and challenges for creative practitioners in their collaboration with generative AI and explored ways to facilitate effective human-AI co-creation. However, there is still a limited understanding of designers' collaboration with AI that supports creative processes distinct from generative AI. To address these gaps, this study focuses on understanding designers' collaboration with decision-making AI, which supports the convergence process in the creative workflow, as opposed to the divergent process supported by generative AI. Specifically, we conducted a case study at an online advertising design company to explore how professional graphic designers at the company perceive the impact of decision-making AI on their creative work practices. The case company incorporated an AI system that predicts the effectiveness of advertising design into the design workflow as a decision-making support tool. Findings from interviews with 12 designers identified how designers trust and rely on AI, its perceived benefits and challenges, and their strategies for navigating the challenges. Based on the findings, we discuss design recommendations for integrating decision-making AI into the creative design workflow.

👍 👎 ♥ Save

Bias-Aware AI Chatbot for Engineering Advising at the University of Maryland A. James Clark School of Engineering

Abstract
Selecting a college major is a difficult decision for many incoming freshmen. Traditional academic advising is often hindered by long wait times, intimidating environments, and limited personalization. AI Chatbots present an opportunity to address these challenges. However, AI systems also have the potential to generate biased responses, prejudices related to race, gender, socioeconomic status, and disability. These biases risk turning away potential students and undermining reliability of AI systems. This study aims to develop a University of Maryland (UMD) A. James Clark School of Engineering Program-specific AI chatbot. Our research team analyzed and mitigated potential biases in the responses. Through testing the chatbot on diverse student queries, the responses are scored on metrics of accuracy, relevance, personalization, and bias presence. The results demonstrate that with careful prompt engineering and bias mitigation strategies, AI chatbots can provide high-quality, unbiased academic advising support, achieving mean scores of 9.76 for accuracy, 9.56 for relevance, and 9.60 for personalization with no stereotypical biases found in the sample data. However, due to the small sample size and limited timeframe, our AI model may not fully reflect the nuances of student queries in engineering academic advising. Regardless, these findings will inform best practices for building ethical AI systems in higher education, offering tools to complement traditional advising and address the inequities faced by many underrepresented and first-generation college students.

AI for Compliance

👍 👎 ♥ Save

AIReg-Bench: Benchmarking Language Models That Assess AI Regulation Compliance

University of Cambridge

Rate this image: 😍 👍 👎

Abstract
As governments move to regulate AI, there is growing interest in using Large Language Models (LLMs) to assess whether or not an AI system complies with a given AI Regulation (AIR). However, there is presently no way to benchmark the performance of LLMs at this task. To fill this void, we introduce AIReg-Bench: the first benchmark dataset designed to test how well LLMs can assess compliance with the EU AI Act (AIA). We created this dataset through a two-step process: (1) by prompting an LLM with carefully structured instructions, we generated 120 technical documentation excerpts (samples), each depicting a fictional, albeit plausible, AI system - of the kind an AI provider might produce to demonstrate their compliance with AIR; (2) legal experts then reviewed and annotated each sample to indicate whether, and in what way, the AI system described therein violates specific Articles of the AIA. The resulting dataset, together with our evaluation of whether frontier LLMs can reproduce the experts' compliance labels, provides a starting point to understand the opportunities and limitations of LLM-based AIR compliance assessment tools and establishes a benchmark against which subsequent LLMs can be compared. The dataset and evaluation code are available at https://github.com/camlsys/aireg-bench.

Help us improve your experience!