Hi!

Your personalized paper recommendations for 02 to 06 February, 2026.

VERA-MH: Reliability and Validity of an Open-Source AI Safety Evaluation in Mental Health

Spring Health

Rate paper: 👍 👎 ♥ Save

AI Insights

The study evaluates the performance of large language models (LLMs) in detecting and confirming suicidal ideation in user-agents. (ML: 0.98)👍👎
The results suggest that LLMs can be a useful tool in identifying individuals at risk of suicide, but their performance should be evaluated in conjunction with human judgment. (ML: 0.98)👍👎
The study suggests that the LLMs can be a useful tool in identifying individuals at risk of suicide, but their performance should be evaluated in conjunction with human judgment. (ML: 0.98)👍👎
Chance-corrected interrater reliability (IRR): A measure of inter-rater reliability that takes into account chance agreements. (ML: 0.98)👍👎
The study finds that the LLMs have varying levels of agreement with clinicians on detecting and confirming suicidal ideation. (ML: 0.98)👍👎
Krippendorff's alpha (α): A measure of inter-rater reliability, which estimates the proportion of agreement between two or more raters. (ML: 0.97)👍👎
The study highlights the potential benefits and limitations of using LLMs in detecting and confirming suicidal ideation. (ML: 0.97)👍👎
The LLMs tend to agree more with each other than with clinicians on detecting and confirming suicidal ideation. (ML: 0.97)👍👎
User-agent risk level: The level of risk associated with a user's suicidal ideation, categorized as None, Low, High, or Imminent. (ML: 0.95)👍👎
The LLMs used in this study are GPT-4, GPT-5, Claude Sonnet 4.5, and Gemini 2.5 Flash. (ML: 0.94)👍👎

Abstract
Millions now use leading generative AI chatbots for psychological support. Despite the promise related to availability and scale, the single most pressing question in AI for mental health is whether these tools are safe. The Validation of Ethical and Responsible AI in Mental Health (VERA-MH) evaluation was recently proposed to meet the urgent need for an evidence-based automated safety benchmark. This study aimed to examine the clinical validity and reliability of the VERA-MH evaluation for AI safety in suicide risk detection and response. We first simulated a large set of conversations between large language model (LLM)-based users (user-agents) and general-purpose AI chatbots. Licensed mental health clinicians used a rubric (scoring guide) to independently rate the simulated conversations for safe and unsafe chatbot behaviors, as well as user-agent realism. An LLM-based judge used the same scoring rubric to evaluate the same set of simulated conversations. We then compared rating alignment across (a) individual clinicians and (b) clinician consensus and the LLM judge, and (c) examined clinicians' ratings of user-agent realism. Individual clinicians were generally consistent with one another in their safety ratings (chance-corrected inter-rater reliability [IRR]: 0.77), thus establishing a gold-standard clinical reference. The LLM judge was strongly aligned with this clinical consensus (IRR: 0.81) overall and within key conditions. Clinician raters generally perceived the user-agents to be realistic. For the potential mental health benefits of AI chatbots to be realized, attention to safety is paramount. Findings from this human evaluation study support the clinical validity and reliability of VERA-MH: an open-source, fully automated AI safety evaluation for mental health. Further research will address VERA-MH generalizability and robustness.

Why we are recommending this paper?
Due to your Interest in AI for Compliance

This paper directly addresses the critical need for safety evaluation of AI tools, particularly generative chatbots used in mental health, aligning with your interest in AI governance and responsible AI development. The focus on validation of AI safety is highly relevant to your concerns about AI applications in sensitive domains.

Is It Possible to Make Chatbots Virtuous? Investigating a Virtue-Based Design Methodology Applied to LLMs

University of Notre Dame

Rate paper: 👍 👎 ♥ Save

AI Insights

The findings highlight the importance of considering human values and well-being in AI development, rather than solely focusing on technical performance or efficiency. (ML: 0.98)👍👎
The participants' responses indicated that they found the design patterns helpful in making decisions about LLM development, particularly in addressing issues related to fairness and accountability. (ML: 0.98)👍👎
The study explores the application of virtue ethics in designing large language models (LLMs) to promote their beneficial impact on human society. (ML: 0.98)👍👎
The study highlights the importance of considering the ethical implications of AI development and the need for designers to prioritize human well-being and values in their work. (ML: 0.98)👍👎
The researchers conducted semi-structured interviews with 13 participants from diverse backgrounds, including software engineering, data science, and theology. (ML: 0.98)👍👎
The study demonstrates the potential of applying virtue ethics in designing LLMs to promote their beneficial impact on society. (ML: 0.97)👍👎
The design patterns presented in the study can serve as a starting point for further research and development of more effective and responsible AI systems. (ML: 0.96)👍👎
The design patterns presented to the participants were based on existing literature and aimed to address common problems in LLM development, such as bias, energy consumption, and transparency. (ML: 0.95)👍👎
Large language models (LLMs): Artificial intelligence systems designed to process and generate human-like language, often used for applications such as chatbots, virtual assistants, and text generation. (ML: 0.94)👍👎
Virtue ethics: An approach to ethics that emphasizes the development of character traits and virtues, rather than following rules or maximizing utility. (ML: 0.92)👍👎

Abstract
With the rapid growth of Large Language Models (LLMs), criticism of their societal impact has also grown. Work in Responsible AI (RAI) has focused on the development of AI systems aimed at reducing harm. Responding to RAI's criticisms and the need to bring the wisdom traditions into HCI, we apply Conwill et al.'s Virtue-Guided Technology Design method to LLMs. We cataloged new ethical design patterns for LLMs and evaluated them through interviews with technologists. Participants valued that the patterns provided more accuracy and robustness, better safety, new research opportunities, increased access and control, and reduced waste. Their concerns were that the patterns could be vulnerable to jailbreaking, were generalizing models too widely, and had potential implementation issues. Overall, participants reacted positively while also acknowledging the tradeoffs involved in ethical LLM design.

Why we are recommending this paper?
Due to your Interest in Chat Designers

Given your interest in AI governance and the ethical considerations of LLMs, this paper’s exploration of a virtue-based design methodology offers a valuable approach to shaping AI behavior. The focus on responsible AI (RAI) aligns directly with your stated interests.

On the Credibility of Evaluating LLMs using Survey Questions

Charles University

Rate paper: 👍 👎 ♥ Save

AI Insights

The authors use a dataset from the World Values Survey (WVS) to evaluate the alignment of LLMs with human values and opinions. (ML: 0.99)👍👎
The authors also find that the cultural values and opinions of LLMs are influenced by their training data and the languages in which they were trained. (ML: 0.99)👍👎
The study highlights the importance of considering cultural differences when developing and evaluating LLMs. (ML: 0.99)👍👎
The study finds that LLMs tend to align more closely with Western cultures, particularly American culture, rather than Eastern cultures. (ML: 0.99)👍👎
The study investigates the cultural values and opinions of large language models (LLMs) across different languages and countries. (ML: 0.98)👍👎
Correlation Norm: A measure of the similarity between two sets of values. (ML: 0.97)👍👎
Self-Correlation Distance: A measure of the distance between a set of values and itself. (ML: 0.95)👍👎
KL Divergence: A measure of the difference between two probability distributions. (ML: 0.93)👍👎
Chain-of-Thought (CoT) Prompt: A type of prompt that requires the model to provide a step-by-step justification before giving an answer. (ML: 0.93)👍👎
Direct Prompt: A type of prompt that asks for a direct answer without requiring justification. (ML: 0.83)👍👎

Abstract
Recent studies evaluate the value orientation of large language models (LLMs) using adapted social surveys, typically by prompting models with survey questions and comparing their responses to average human responses. This paper identifies limitations in this methodology that, depending on the exact setup, can lead to both underestimating and overestimating the similarity of value orientation. Using the World Value Survey in three languages across five countries, we demonstrate that prompting methods (direct vs. chain-of-thought) and decoding strategies (greedy vs. sampling) significantly affect results. To assess the interaction between answers, we introduce a novel metric, self-correlation distance. This metric measures whether LLMs maintain consistent relationships between answers across different questions, as humans do. This indicates that even a high average agreement with human data, when considering LLM responses independently, does not guarantee structural alignment in responses. Additionally, we reveal a weak correlation between two common evaluation metrics, mean-squared distance and KL divergence, which assume that survey answers are independent of each other. For future research, we recommend CoT prompting, sampling-based decoding with dozens of samples, and robust analysis using multiple metrics, including self-correlation distance.

Why we are recommending this paper?
Due to your Interest in LLMs for Compliance

This research investigates the limitations of current methods for evaluating LLMs, specifically using survey questions, which is a key area of concern when assessing AI trustworthiness and reliability. Understanding these limitations is crucial for developing robust evaluation strategies.

Can Developers rely on LLMs for Secure IaC Development?

Technische Universitt Clausthal

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

F1 score: A measure of a model's accuracy, combining precision and recall. (ML: 0.98)👍👎
Guided prompts improve detection accuracy and F1 scores. (ML: 0.97)👍👎
Large Language Models (LLMs): AI models that can process and generate human-like text. (ML: 0.93)👍👎
The study highlights the need for further research to improve LLMs' performance in detecting security smells and generating secure IaC scripts. (ML: 0.86)👍👎
LLMs can detect security smells in IaC scripts, but their performance is not promising. (ML: 0.81)👍👎
Security smells: Patterns or anomalies in code that indicate potential security vulnerabilities. (ML: 0.79)👍👎
LLMs generate insecure code when asked to produce secure solutions. (ML: 0.78)👍👎
Infrastructure as Code (IaC): A method of managing infrastructure using code, rather than manual configuration. (ML: 0.77)👍👎
Further research is needed to leverage LLMs' potential for secure IaC development. (ML: 0.73)👍👎
LLMs have potential for secure IaC development, but their current limitations must be addressed. (ML: 0.66)👍👎

Abstract
We investigated the capabilities of GPT-4o and Gemini 2.0 Flash for secure Infrastructure as Code (IaC) development. For security smell detection, on the Stack Overflow dataset, which primarily contains small, simplified code snippets, the models detected at least 71% of security smells when prompted to analyze code from a security perspective (general prompt). With a guided prompt (adding clear, step-by-step instructions), this increased to 78%.In GitHub repositories, which contain complete, real-world project scripts, a general prompt was less effective, leaving more than half of the smells undetected. However, with the guided prompt, the models uncovered at least 67% of the smells. For secure code generation, we prompted LLMs with 89 vulnerable synthetic scenarios and observed that only 7% of the generated scripts were secure. Adding an explicit instruction to generate secure code increased GPT secure output rate to 17%, while Gemini changed little (8%). These results highlight the need for further research to improve LLMs' capabilities in assisting developers with secure IaC development.

Why we are recommending this paper?
Due to your Interest in LLMs for Compliance

This paper examines the security implications of using LLMs for Infrastructure as Code (IaC) development, a critical area for compliance and risk management within your specified interests. The investigation into security smells detection is particularly relevant.

Choice via AI

Maastricht University

Rate paper: 👍 👎 ♥ Save

AI Insights

The text also discusses the relationship between groundedness and maximization of complete and transitive preference relations. (ML: 0.97)👍👎
Some of the key concepts explored include consistency, monotonicity, and weak axiom of revealed preference (WARP). (ML: 0.97)👍👎
The results have implications for understanding rationalizability and groundedness in choice theory. (ML: 0.95)👍👎
GAIC: Grounded Axiom of Revealed Preference. (ML: 0.92)👍👎
A choice function c is said to satisfy GMAIC if it maximizes a complete and transitive preference relation over non-empty subsets of X. (ML: 0.91)👍👎
Groundedness: A choice function c satisfies groundedness if for all x ∈ X, there exists a set S ⊆ X \{x such that I(S) = ∅. (ML: 0.89)👍👎
GMAIC: Grounded Maximizing Axiom of Choice. (ML: 0.89)👍👎
The provided text provides a comprehensive proof of various theorems and propositions related to choice theory. (ML: 0.89)👍👎
The proofs cover topics such as injectivity, surjectivity, and double union closure of interpretation functions. (ML: 0.88)👍👎
A choice function c is said to satisfy GAIC if it satisfies groundedness and the corresponding interpretation I satisfies consistency, monotonicity, and WARP. (ML: 0.88)👍👎
The proofs demonstrate the relationship between different axioms and properties of choice functions. (ML: 0.86)👍👎
The provided text appears to be a proof of various theorems and propositions related to choice theory, specifically in the context of rationalizability and groundedness. (ML: 0.86)👍👎

Abstract
This paper proposes a model of choice via agentic artificial intelligence (AI). A key feature is that the AI may misinterpret a menu before recommending what to choose. A single acyclicity condition guarantees that there is a monotonic interpretation and a strict preference relation that together rationalize the AI's recommendations. Since this preference is in general not unique, there is no safeguard against it misaligning with that of a decision maker. What enables the verification of such AI alignment is interpretations satisfying double monotonicity. Indeed, double monotonicity ensures full identifiability and internal consistency. But, an additional idempotence property is required to guarantee that recommendations are fully rational and remain grounded within the original feasible set.

Why we are recommending this paper?
Due to your Interest in AI Governance

This paper explores a novel model of choice involving AI, focusing on potential misinterpretations and preference relations – a fascinating area for understanding AI’s impact on human decision-making. The concept of an AI misinterpreting a menu is a compelling illustration of potential issues.

Structural shifts in institutional participation and collaboration within the AI arXiv preprint research ecosystem

ISI

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

Collaboration scale differs noticeably depending on the subfield, with computational linguistics exhibiting the largest teams and cross-sector collaboration varying by subfield. (ML: 0.98)👍👎
The study relies on metadata from arXiv, which may not be comprehensive or representative of the broader AI research community. (ML: 0.97)👍👎
RQ1: Quantifying changes in AI research output RQ2: Quantifying changes in collaboration patterns The growth of AI research is not uniformly distributed across subfields, with machine learning, computer vision, and natural language processing consistently emerging as the dominant AI subfields. (ML: 0.97)👍👎
Industry-only papers show substantially larger team sizes and greater variability, increasing from an average of approximately 4.6 authors in 2021 to over 8 authors in 2024. (ML: 0.94)👍👎
Mixed academic–industry papers consistently have the largest team sizes across all years, growing from an average of about 5.7 authors in 2021 to nearly 7 authors by 2025. (ML: 0.83)👍👎
Academic-only papers consistently exhibit the smallest team sizes across all years, growing gradually from an average of about 3.8 authors in 2021 to roughly 4.7 authors in 2025. (ML: 0.82)👍👎
The number of authors per paper increased steadily from 2021 through 2025, with an average of approximately 4.4 authors in 2021 and about 5.5 authors by 2025. (ML: 0.79)👍👎
Papers labeled as unknown affiliation fall between academic and mixed papers in terms of team size, with averages increasing over time from approximately 4.2 authors in 2021 to about 5.4 authors in 2025. (ML: 0.73)👍👎

Abstract
The emergence of large language models (LLMs) represents a significant technological shift within the scientific ecosystem, particularly within the field of artificial intelligence (AI). This paper examines structural changes in the AI research landscape using a dataset of arXiv preprints (cs.AI) from 2021 through 2025. Given the rapid pace of AI development, the preprint ecosystem has become a critical barometer for real-time scientific shifts, often preceding formal peer-reviewed publication by months or years. By employing a multi-stage data collection and enrichment pipeline in conjunction with LLM-based institution classification, we analyze the evolution of publication volumes, author team sizes, and academic--industry collaboration patterns. Our results reveal an unprecedented surge in publication output following the introduction of ChatGPT, with academic institutions continuing to provide the largest volume of research. However, we observe that academic--industry collaboration is still suppressed, as measured by a Normalized Collaboration Index (NCI) that remains significantly below the random-mixing baseline across all major subfields. These findings highlight a continuing institutional divide and suggest that the capital-intensive nature of generative AI research may be reshaping the boundaries of scientific collaboration.

Why we are recommending this paper?
Due to your Interest in AI Governance

Co-Designing Collaborative Generative AI Tools for Freelancers

Northeastern University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Insights

A more nuanced approach to AI-generated content is needed, one that takes into account the potential risks and consequences of over-reliance on technology. (ML: 0.98)👍👎
The study suggests that relying too heavily on generative AI tools can lead to a shift in collaborative practice, where teams reason through problems together, potentially displacing collective deliberation and shared understanding. (ML: 0.98)👍👎
Designers and developers should prioritize preserving human expertise and contextual understanding in collaborative work. (ML: 0.98)👍👎
Freelancers emphasized the importance of preserving space for human expertise and contextual understanding when making high-stakes collective decisions. (ML: 0.98)👍👎
The findings of the study highlight that current generative AI tools embody technological rationality through context failure and over-reliance. (ML: 0.97)👍👎
A 'context failure' occurs when a generative AI tool fails to understand the nuances of a specific context or situation, leading to outputs that are not relevant or useful. (ML: 0.97)👍👎
Over-reliance on generative AI tools can lead to a loss of creative agency and skills among freelancers, as they become accustomed to relying on technology for even minor tasks. (ML: 0.97)👍👎
The study's findings have significant implications for the design and development of collaborative generative AI tools. (ML: 0.96)👍👎
Freelancers expressed concerns about the potential for AI-generated plagiarism and loss of creative agency, which could damage their reputations and reduce their chances of securing future joint projects. (ML: 0.96)👍👎
The term 'technological rationality' refers to the idea that technology can solve complex social problems through standardization and efficiency. (ML: 0.95)👍👎

Abstract
Most generative AI tools prioritize individual productivity and personalization, with limited support for collaboration. Designed for traditional workplaces, these tools do not fit freelancers' short-term teams or lack of shared institutional support, which can worsen their isolation and overlook freelancing platform dynamics. This mismatch means that, instead of empowering freelancers, current generative AI tools could reinforce existing precarity and make freelancer collaboration harder. To investigate how to design generative AI tools to support freelancer collaboration, we conducted co-design sessions with 27 freelancers. A key concern that emerged was the risk of AI systems compromising their creative agency and work identities when collaborating, especially when AI tools could reproduce content without attribution, threatening the authenticity and distinctiveness of their collaborative work. Freelancers proposed "auxiliary AI" systems, human-guided tools that support their creative agencies and identities, allowing for flexible freelancer-led collaborations that promote "productive friction". Drawing on Marcuse's concept of technological rationality, we argue that freelancers are resisting one-dimensional, efficiency-driven AI, and instead envisioning technologies that preserve their collective creative agencies. We conclude with design recommendations for collaborative generative AI tools for freelancers.

Why we are recommending this paper?
Due to your Interest in Chat Designers

💬 Help Shape Our Pricing

We're exploring pricing options to make this project sustainable. Take 3 minutes to share what you'd be willing to pay (if anything). Your input guides our future investment.

Share Your Feedback

Help us improve your experience!

This project is on its early stages your feedback can be pivotal on the future of the project. Let us know what you think about this week's papers and suggestions!

Give Feedback