University of New Hampshire
AI Insights - The authors propose a framework for evaluating LLMs as judges, which could potentially improve the accuracy and fairness of AI-generated decisions. (ML: 0.98)ππ
- The authors propose a framework for evaluating large language models (LLMs) as judges, which they call LLM-as-a-Judge. (ML: 0.98)ππ
- They also discuss the challenges and limitations of using LLMs in legal settings, including the potential for hallucinations and biases. (ML: 0.97)ππ
- LLM-as-a-Judge: A framework for evaluating large language models as judges, which involves using them to make decisions and then evaluating their performance. (ML: 0.97)ππ
- It also emphasizes the importance of addressing the challenges and limitations of using LLMs in legal settings. (ML: 0.96)ππ
- The paper highlights the need for more research on the evaluation of AI-generated summaries in legal contexts. (ML: 0.96)ππ
- The paper discusses the importance of evaluating AI-generated summaries in legal contexts. (ML: 0.95)ππ
- Bias: An unfair or prejudiced attitude towards certain groups or individuals. (ML: 0.94)ππ
- The paper does not provide a clear explanation of how to implement the LLM-as-a-Judge framework. (ML: 0.94)ππ
- Hallucination: A term used to describe when a model generates information that is not present in the input data. (ML: 0.90)ππ
Abstract
While large language models (LLMs) are increasingly used to summarize long documents, this trend poses significant challenges in the legal domain, where the factual accuracy of deposition summaries is crucial. Nugget-based methods have been shown to be extremely helpful for the automated evaluation of summarization approaches. In this work, we translate these methods to the user side and explore how nuggets could directly assist end users. Although prior systems have demonstrated the promise of nugget-based evaluation, its potential to support end users remains underexplored. Focusing on the legal domain, we present a prototype that leverages a factual nugget-based approach to support legal professionals in two concrete scenarios: (1) determining which of two summaries is better, and (2) manually improving an automatically generated summary.
Why we are recommending this paper?
Due to your Interest in AI for Compliance
This paper directly addresses the critical need for evaluating AI-generated summaries, a key concern given the user's interest in LLMs for compliance and accuracy in high-stakes domains like legal proceedings. The focus on automated evaluation methods aligns with the user's interest in robust AI governance.
University of Toronto
AI Insights - The study explores the use of Large Language Models (LLMs) in tracking case progress and deviations in child welfare cases. (ML: 0.99)ππ
- demonstrated that child welfare casenotes follow a structured sequence of events, and segmenting cases into equal intervals can reveal temporal trends across cases of varying durations. (ML: 0.98)ππ
- The study found that LLMs can accurately track case progress and deviations in less complex, shorter duration cases. (ML: 0.97)ππ
- Cohen's kappa: a statistical measure that calculates the degree of agreement between two raters or models. (ML: 0.97)ππ
- However, their performance decreased as cases became longer in duration. (ML: 0.97)ππ
- The study investigates the use of LLMs in tracking case progress and deviations in child welfare cases, highlighting their potential benefits and limitations. (ML: 0.96)ππ
- The model struggled to infer how acronyms or specific service provider names were related to an activity. (ML: 0.95)ππ
- The LocalLLM's performance decreased as cases became longer in duration. (ML: 0.94)ππ
- The LocalLLM tended to label regular casenote entries as Activity-relevant due to its prompt configuration limitations and dataset characteristics. (ML: 0.94)ππ
- False Negative Rate (FNR): the proportion of false negatives out of all negative predictions made by the model. (ML: 0.92)ππ
- False Positive Rate (FPR): the proportion of false positives out of all positive predictions made by the model. (ML: 0.91)ππ
- Previous work by Saxena et al. (ML: 0.66)ππ
Abstract
Governments are the primary providers of essential public services and are responsible for delivering them effectively. In high-stakes decision-making domains such as child welfare (CW), agencies must protect children without unnecessarily prolonging a family's engagement with the system. With growing optimism around AI, governments are pushing for its integration but concerns regarding feasibility and harms remain. Through collaborations with a large Canadian CW agency, we examined how LocalLLM and BERTopic models can track CW case progress. We demonstrate how the tools can potentially assist workers in opportunistically addressing gaps in their work by signaling case progress/deviations. And yet, we also show how they fail to detect case trajectories that require discretionary judgments grounded in social work training, areas where practitioners would actually want support to pre-emptively address substantive case concerns. We also provide a roadmap of future participatory directions to co-design language tools for/with the public sector.
Why we are recommending this paper?
Due to your Interest in LLMs for Compliance
This research explores the application of LLMs in public services, directly relating to the userβs interest in AI governance and its potential impact on critical systems. The paperβs focus on effective delivery aligns with the user's interest in responsible AI implementation.
University of the Cumberlands
AI Insights - However, its development and deployment require careful consideration of the challenges associated with these technologies, including safety, effectiveness, and alignment with human values. (ML: 0.97)ππ
- The development of Agentic AI requires a multidisciplinary approach involving clinicians, data scientists, and ethicists to ensure that these systems are safe, effective, and aligned with human values. (ML: 0.97)ππ
- Lack of standardization in Agentic AI development and deployment Insufficient consideration of human values and ethics in Agentic AI design (ML: 0.95)ππ
- Human-AI Ecosystems: The complex systems that arise when humans interact with Agentic AI, requiring a new set of skills and competencies to manage effectively. (ML: 0.93)ππ
- Agentic AI in healthcare has the potential to transform patient care by enabling personalized medicine and improving treatment outcomes. (ML: 0.91)ππ
- Regulatory frameworks such as the EU AI Act provide a starting point for establishing guidelines for Agentic AI in healthcare, but more work is needed to address the complexities of these systems. (ML: 0.90)ππ
- Agentic AI has the potential to revolutionize healthcare by enabling personalized medicine and improving treatment outcomes. (ML: 0.89)ππ
- The EU AI Act and other regulatory frameworks aim to establish guidelines for the development and deployment of Agentic AI in healthcare, but more work is needed to address the challenges associated with these technologies. (ML: 0.89)ππ
- Agentic AI: A type of artificial intelligence that can perform tasks independently and make decisions based on its own goals and motivations. (ML: 0.83)ππ
Abstract
Healthcare organizations are beginning to embed agentic AI into routine workflows, including clinical documentation support and early-warning monitoring. As these capabilities diffuse across departments and vendors, health systems face agent sprawl, causing duplicated agents, unclear accountability, inconsistent controls, and tool permissions that persist beyond the original use case. Existing AI governance frameworks emphasize lifecycle risk management but provide limited guidance for the day-to-day operations of agent fleets. We propose a Unified Agent Lifecycle Management (UALM) blueprint derived from a rapid, practice-oriented synthesis of governance standards, agent security literature, and healthcare compliance requirements. UALM maps recurring gaps onto five control-plane layers: (1) an identity and persona registry, (2) orchestration and cross-domain mediation, (3) PHI-bounded context and memory, (4) runtime policy enforcement with kill-switch triggers, and (5) lifecycle management and decommissioning linked to credential revocation and audit logging. A companion maturity model supports staged adoption. UALM offers healthcare CIOs, CISOs, and clinical leaders an implementable pattern for audit-ready oversight that preserves local innovation and enables safer scaling across clinical and administrative domains.
Why we are recommending this paper?
Due to your Interest in AI Governance
This paper investigates agentic AI governance, a crucial area for managing complex AI systems, particularly relevant to the userβs interest in AI governance and its application within healthcare. The focus on accountability and lifecycle management is a strong match.
Delft University of Technology
AI Insights - Ecological validity: The extent to which the results of an experiment can be generalized to real-world situations. (ML: 0.99)ππ
- Fair compensation: Ensuring that participants receive a fair wage for their work, considering factors such as task complexity and required expertise. (ML: 0.98)ππ
- The Incentive-Tuning Framework provides a standardized solution for designing effective incentive schemes in human-AI decision-making studies. (ML: 0.97)ππ
- The Incentive-Tuning Framework is a standardized solution for designing and documenting effective incentive schemes in human-AI decision-making studies. (ML: 0.97)ππ
- Incentive scheme: A system of rewards or penalties designed to motivate participants in human-AI decision-making studies. (ML: 0.97)ππ
- The Incentive-Tuning Framework aims to address methodological challenges surrounding incentive design and provide a solution for researchers to tune 'appropriate' incentive schemes for their specific studies. (ML: 0.97)ππ
- A well-designed framework can foster a standardized, systematic, and comprehensive approach to designing effective incentive schemes. (ML: 0.96)ππ
- Researchers should prioritize intentional design and alignment with research goals when employing an incentive scheme. (ML: 0.96)ππ
- Researchers should explicitly identify the purpose of employing an incentive scheme to ensure intentional design and alignment with research goals. (ML: 0.95)ππ
- The framework consists of five steps: identifying the purpose of employing an incentive scheme, coming up with a base pay, designing a bonus structure, gathering participant feedback, and reflecting on design implications. (ML: 0.88)ππ
Abstract
AI has revolutionised decision-making across various fields. Yet human judgement remains paramount for high-stakes decision-making. This has fueled explorations of collaborative decision-making between humans and AI systems, aiming to leverage the strengths of both. To explore this dynamic, researchers conduct empirical studies, investigating how humans use AI assistance for decision-making and how this collaboration impacts results. A critical aspect of conducting these studies is the role of participants, often recruited through crowdsourcing platforms. The validity of these studies hinges on the behaviours of the participants, hence effective incentives that can potentially affect these behaviours are a key part of designing and executing these studies. In this work, we aim to address the critical role of incentive design for conducting empirical human-AI decision-making studies, focusing on understanding, designing, and documenting incentive schemes. Through a thematic review of existing research, we explored the current practices, challenges, and opportunities associated with incentive design for human-AI decision-making empirical studies. We identified recurring patterns, or themes, such as what comprises the components of an incentive scheme, how incentive schemes are manipulated by researchers, and the impact they can have on research outcomes. Leveraging the acquired understanding, we curated a set of guidelines to aid researchers in designing effective incentive schemes for their studies, called the Incentive-Tuning Framework, outlining how researchers can undertake, reflect on, and document the incentive design process. By advocating for a standardised yet flexible approach to incentive design and contributing valuable insights along with practical tools, we hope to pave the way for more reliable and generalizable knowledge in the field of human-AI decision-making.
Why we are recommending this paper?
Due to your Interest in AI Governance
This research delves into the design of incentives for human-AI collaboration, which is essential for effective AI governance and leveraging human expertise. The paperβs exploration of collaborative decision-making aligns with the userβs interest in combining human and AI capabilities.
Vanderbilt University
AI Insights - The researchers found that LLMs can be effective tools for providing formative feedback, but they also highlight the importance of human oversight and evaluation to ensure accuracy and fairness. (ML: 0.99)ππ
- However, the researchers also note that there are limitations to using LLMs in education, including issues related to bias, accuracy, and transparency. (ML: 0.98)ππ
- The study evaluates the effectiveness of using large language models (LLMs) for educational purposes, specifically in providing feedback and assessing student performance. (ML: 0.98)ππ
- Bias: The tendency for systems or models to favor certain groups or outcomes over others. (ML: 0.98)ππ
- Formative Feedback: Feedback provided during the learning process to help students improve their understanding and skills. (ML: 0.97)ππ
- Automated Grading: The use of technology, such as LLMs, to grade student assignments and assessments. (ML: 0.97)ππ
- Accuracy: The degree to which a system or model produces correct results. (ML: 0.97)ππ
- The study concludes that while LLMs have the potential to revolutionize education, their use must be carefully managed and monitored to ensure they are used responsibly and effectively. (ML: 0.97)ππ
- The study suggests that LLMs can help reduce the workload of teachers by automating tasks such as grading and providing feedback, allowing them to focus on more critical aspects of teaching. (ML: 0.97)ππ
- Large Language Models (LLMs): AI models that can process and generate human-like language. (ML: 0.93)ππ
Abstract
As large language models (LLMs) become increasingly common in educational applications, there is a growing need for evidence-based methods to design and evaluate LLM prompts that produce personalized and pedagogically aligned out-puts. This study presents a generalizable, systematic approach for evaluating prompts, demonstrated through an analysis of LLM-generated follow-up questions in a structured dialogue activity. Six prompt templates were designed and tested. The templates incorporated established prompt engineering patterns, with each prompt emphasizing distinct pedagogical strategies. The prompt templates were compared through a tournament-style evaluation framework that can be adapted for other educational applications. The tournament employed the Glicko2 rating system with eight judges evaluating question pairs across three dimensions: format, dialogue support, and appropriateness for learners. Data was sourced from 120 authentic user interactions across three distinct educational deployments. Results showed that a single prompt related to strategic reading out-performed other templates with win probabilities ranging from 81% to 100% in pairwise comparisons. This prompt combined persona and context manager pat-terns and was designed to support metacognitive learning strategies such as self-directed learning. The methodology showcases how educational technology re- searchers can systematically evaluate and improve prompt designs, moving beyond ad-hoc prompt engineering toward evidence-based prompt development for educational applications.
Why we are recommending this paper?
Due to your Interest in LLMs for Compliance
Given the userβs interest in LLMs for compliance and AI governance, this paperβs focus on evaluating LLM prompts for educational applications is highly relevant. The systematic approach for prompt design directly addresses the need for responsible AI development and deployment.
ShanghaiTech University
AI Insights - The system's use of AI-powered tools may raise concerns about bias and accuracy. (ML: 0.99)ππ
- The system uses SHapley Additive exPlanations (SHAP) to illustrate each dimension's contribution to the predicted preference score. (ML: 0.97)ππ
- The system uses a combination of natural language processing (NLP) and computer vision techniques to analyze user input and generate design options. (ML: 0.97)ππ
- The system's reliance on large datasets may make it difficult to implement in scenarios where data is limited or unavailable. (ML: 0.96)ππ
- The system's ability to integrate user feedback into the generation process allows designers to refine their designs based on user preferences. (ML: 0.95)ππ
- ComfyUI: A generative model workflow tool for local development. (ML: 0.90)ππ
- VisionRealistic v2 FluxDev: A computer vision model used to produce high-quality scene graphs tailored to try-on scenarios. (ML: 0.88)ππ
- GPT-4: A large language model developed by OpenAI that can perform a wide range of tasks, including text generation and translation. (ML: 0.88)ππ
- DesignBridge is a system that uses AI-powered tools to analyze user input and generate design options. (ML: 0.81)ππ
- The system's technical implementation details are provided in the following section. (ML: 0.77)ππ
- DesignBridge employs Vue.js for front-end development and utilizes Python for the back-end implementation. (ML: 0.67)ππ
Abstract
Effective collaboration between designers and users is important for fashion design, which can increase the user acceptance of fashion products and thereby create value. However, it remains an enduring challenge, as traditional designer-centric approaches restrict meaningful user participation, while user-driven methods demand design proficiency, often marginalizing professional creative judgment. Current co-design practices, including workshops and AI-assisted frameworks, struggle with low user engagement, inefficient preference collection, and difficulties in balancing user feedback with design considerations. To address these challenges, we conducted a formative study with designers and users experienced in co-design (N=7), identifying critical challenges for current collaboration between designers and users in the co-design process, and their requirements. Informed by these insights, we introduce DesignBridge, a multi-platform AI-enhanced interactive system that bridges designer expertise and user preferences through three stages: (1) Initial Design Framing, where designers define initial concepts. (2) Preference Expression Collection, where users intuitively articulate preferences via interactive tools. (3) Preference-Integrated Design, where designers use AI-assisted analytics to integrate feedback into cohesive designs. A user study demonstrates that DesignBridge significantly enhances user preference collection and analysis, enabling designers to integrate diverse preferences with professional expertise.
Why we are recommending this paper?
Due to your Interest in Chat Designers
RPTU University of KaiserslauternLandau
AI Insights - A case study with 18 high-school students demonstrated that participants without prior experience were able to implement functional designs. (ML: 0.96)ππ
- The paper explores the use of Large Language Models (LLMs) to make chip design more accessible to beginners. (ML: 0.96)ππ
- An LLM-based chat agent is integrated into a browser-based learning workflow built upon the Tiny Tapeout ecosystem. (ML: 0.91)ππ
- LLM: Large Language Model RTL: Register-Transfer Level (a level of abstraction for digital circuits) Tiny Tapeout: A shared silicon tape out platform accessible to everyone VGA: Video Graphics Array The LLM-based chat agent and workflow are effective in enabling non-experts to design functional chips. (ML: 0.86)ππ
- All eight student groups successfully developed VGA chips in a 130 nm technology during a 90-minute session. (ML: 0.79)ππ
- The approach enables users to progress from an initial design idea through RTL code generation to a tapeout-ready chip within a short time frame. (ML: 0.73)ππ
Abstract
This paper presents an LLM-based learning platform for chip design education, aiming to make chip design accessible to beginners without overwhelming them with technical complexity. It represents the first educational platform that assists learners holistically across both frontend and backend design. The proposed approach integrates an LLM-based chat agent into a browser-based workflow built upon the Tiny Tapeout ecosystem. The workflow guides users from an initial design idea through RTL code generation to a tapeout-ready chip. To evaluate the concept, a case study was conducted with 18 high-school students. Within a 90-minute session they developed eight functional VGA chip designs in a 130 nm technology. Despite having no prior experience in chip design, all groups successfully implemented tapeout-ready projects. The results demonstrate the feasibility and educational impact of LLM-assisted chip design, highlighting its potential to attract and inspire early learners and significantly broaden the target audience for the field.
Why we are recommending this paper?
Due to your Interest in Chat Designers