🎯 Top Personalized Recommendations
Lassonde School Of EnginE
Why we think this paper is great for you:
This paper directly explores how large language models can be leveraged for verifying critical system requirements, offering valuable insights into automated compliance and assurance. It will be highly relevant for understanding practical applications of AI in regulatory contexts.
Abstract
Assurance cases allow verifying the correct implementation of certain
non-functional requirements of mission-critical systems, including their
safety, security, and reliability. They can be used in the specification of
autonomous driving, avionics, air traffic control, and similar systems. They
aim to reduce risks of harm of all kinds including human mortality,
environmental damage, and financial loss. However, assurance cases often tend
to be organized as extensive documents spanning hundreds of pages, making their
creation, review, and maintenance error-prone, time-consuming, and tedious.
Therefore, there is a growing need to leverage (semi-)automated techniques,
such as those powered by generative AI and large language models (LLMs), to
enhance efficiency, consistency, and accuracy across the entire assurance-case
lifecycle. In this paper, we focus on assurance case review, a critical task
that ensures the quality of assurance cases and therefore fosters their
acceptance by regulatory authorities. We propose a novel approach that
leverages the \textit{LLM-as-a-judge} paradigm to automate the review process.
Specifically, we propose new predicate-based rules that formalize
well-established assurance case review criteria, allowing us to craft LLM
prompts tailored to the review task. Our experiments on several
state-of-the-art LLMs (GPT-4o, GPT-4.1, DeepSeek-R1, and Gemini 2.0 Flash) show
that, while most LLMs yield relatively good review capabilities, DeepSeek-R1
and GPT-4.1 demonstrate superior performance, with DeepSeek-R1 ultimately
outperforming GPT-4.1. However, our experimental results also suggest that
human reviewers are still needed to refine the reviews LLMs yield.
AI Summary - Dialectic Extension of GSN: An extension to the GSN standard that supports constructive criticism and logical dispute in arguments by introducing concepts such as Goals (to challenge arguments), Solutions (as reference points), Challenges, and Defeated elements (including Rebuttal and Undercutting defeaters). [3]
- The paper introduces a novel LLM-as-a-judge paradigm for automating assurance case review, addressing a critical gap in existing LLM-based system assurance approaches. [2]
- Predicate-based rules are proposed to formalize established assurance case review criteria, enabling the creation of tailored Chain-of-Thought (CoT) prompts for LLMs. [2]
- DeepSeek-R1 and GPT-4.1 demonstrate superior performance among tested LLMs (GPT-4o, GPT-4.1, DeepSeek-R1, Gemini 2.0 Flash) in reviewing assurance cases, with DeepSeek-R1 slightly outperforming GPT-4.1. [2]
- Despite promising results, LLMs are currently effective as *assistants* in assurance case review, not replacements for human reviewers, due to issues like hallucination and generic feedback. [2]
- The approach adapts mathematical symbols and notations from automatic code review (Yu et al., 2024) to formalize assurance case review issues and suggestions within CoT prompts. [2]
- The study identifies four key assurance case review criteria (Argument Comprehension, Well-formedness, Expressive Sufficiency, Argument Criticism and Defeat) and formalizes them for LLM application. [2]
- Assurance Case: A structured, well-reasoned, and auditable collection of arguments, supported by evidence, intended to demonstrate that a system's non-functional requirements (e.g., safety, security, reliability) have been correctly and adequately implemented. [2]
- LLM-as-a-judge Paradigm: A novel assessment solution that leverages Large Language Models (LLMs) to act as raters or judges, evaluating the quality of artifacts by merging the scalability of automatic methods with context-sensitive reasoning. [2]
- Goal Structuring Notation (GSN): A widely used graphical notation for representing assurance cases as tree-like goal structures, comprising elements like Goals (claims), Strategies (arguments), and Solutions (evidence). [1]
KFUPM King Fahd Univeris
Why we think this paper is great for you:
You will find this paper insightful as it investigates the crucial alignment of AI systems with human values, particularly for large language models. It touches upon the ethical considerations and responsible deployment of AI.
Abstract
Large Language Models (LLMs) are increasingly employed in software
engineering tasks such as requirements elicitation, design, and evaluation,
raising critical questions regarding their alignment with human judgments on
responsible AI values. This study investigates how closely LLMs' value
preferences align with those of two human groups: a US-representative sample
and AI practitioners. We evaluate 23 LLMs across four tasks: (T1) selecting key
responsible AI values, (T2) rating their importance in specific contexts, (T3)
resolving trade-offs between competing values, and (T4) prioritizing software
requirements that embody those values. The results show that LLMs generally
align more closely with AI practitioners than with the US-representative
sample, emphasizing fairness, privacy, transparency, safety, and
accountability. However, inconsistencies appear between the values that LLMs
claim to uphold (Tasks 1-3) and the way they prioritize requirements (Task 4),
revealing gaps in faithfulness between stated and applied behavior. These
findings highlight the practical risk of relying on LLMs in requirements
engineering without human oversight and motivate the need for systematic
approaches to benchmark, interpret, and monitor value alignment in AI-assisted
software development.
Beijing Caizhi Tech
Why we think this paper is great for you:
This research presents a framework for ensuring the safety and trustworthy deployment of AI agents, which is essential for robust AI systems. It offers practical approaches to mitigate risks associated with advanced AI.
Abstract
With the widespread application of Large Language Models (LLMs), their
associated security issues have become increasingly prominent, severely
constraining their trustworthy deployment in critical domains. This paper
proposes a novel safety response framework designed to systematically safeguard
LLMs at both the input and output levels. At the input level, the framework
employs a supervised fine-tuning-based safety classification model. Through a
fine-grained four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused
Attention), it performs precise risk identification and differentiated handling
of user queries, significantly enhancing risk coverage and business scenario
adaptability, and achieving a risk recall rate of 99.3%. At the output level,
the framework integrates Retrieval-Augmented Generation (RAG) with a
specifically fine-tuned interpretation model, ensuring all responses are
grounded in a real-time, trustworthy knowledge base. This approach eliminates
information fabrication and enables result traceability. Experimental results
demonstrate that our proposed safety control model achieves a significantly
higher safety score on public safety evaluation benchmarks compared to the
baseline model, TinyR1-Safety-8B. Furthermore, on our proprietary high-risk
test set, the framework's components attained a perfect 100% safety score,
validating their exceptional protective capabilities in complex risk scenarios.
This research provides an effective engineering pathway for building
high-security, high-trust LLM applications.
Guangzhou University, Zhe
Why we think this paper is great for you:
This paper provides a framework for auditing multi-modal large language models for privacy risks, a critical aspect of responsible AI development. It offers methods to assess and manage potential data privacy concerns.
Abstract
Recent advances in multi-modal Large Language Models (M-LLMs) have
demonstrated a powerful ability to synthesize implicit information from
disparate sources, including images and text. These resourceful data from
social media also introduce a significant and underexplored privacy risk: the
inference of sensitive personal attributes from seemingly daily media content.
However, the lack of benchmarks and comprehensive evaluations of
state-of-the-art M-LLM capabilities hinders the research of private attribute
profiling on social media. Accordingly, we propose (1) PRISM, the first
multi-modal, multi-dimensional and fine-grained synthesized dataset
incorporating a comprehensive privacy landscape and dynamic user history; (2)
an Efficient evaluation framework that measures the cross-modal privacy
inference capabilities of advanced M-LLM. Specifically, PRISM is a large-scale
synthetic benchmark designed to evaluate cross-modal privacy risks. Its key
feature is 12 sensitive attribute labels across a diverse set of multi-modal
profiles, which enables targeted privacy analysis. These profiles are generated
via a sophisticated LLM agentic workflow, governed by a prior distribution to
ensure they realistically mimic social media users. Additionally, we propose a
Multi-Agent Inference Framework that leverages a pipeline of specialized LLMs
to enhance evaluation capabilities. We evaluate the inference capabilities of
six leading M-LLMs (Qwen, Gemini, GPT-4o, GLM, Doubao, and Grok) on PRISM. The
comparison with human performance reveals that these MLLMs significantly
outperform in accuracy and efficiency, highlighting the threat of potential
privacy risks and the urgent need for robust defenses.
The Aula Fellowship
Why we think this paper is great for you:
You will appreciate this paper's examination of power dynamics and decision-making within the AI field. It offers a broader perspective on how AI systems are shaped and governed.
Abstract
This paper examines how decision makers in academia, government, business,
and civil society navigate questions of power in implementations of artificial
intelligence. The study explores how individuals experience and exercise levers
of power, which are presented as social mechanisms that shape institutional
responses to technological change. The study reports on the responses of
personalized questionnaires designed to gather insight on a decision maker's
institutional purview, based on an institutional governance framework developed
from the work of Neo-institutionalists. Findings present the anonymized, real
responses and circumstances of respondents in the form of twelve fictional
personas of high-level decision makers from North America and Europe. These
personas illustrate how personal agency, organizational logics, and
institutional infrastructures may intersect in the governance of AI. The
decision makers' responses to the questionnaires then inform a discussion of
the field-level personal power of decision makers, methods of fostering
institutional stability in times of change, and methods of influencing
institutional change in the field of AI. The final section of the discussion
presents a table of the dynamics of the levers of power in the field of AI for
change makers and five testable hypotheses for institutional and social
movement researchers. In summary, this study provides insight on the means for
policymakers within institutions and their counterparts in civil society to
personally engage with AI governance.
German Cancer Research D
Why we think this paper is great for you:
While focused on a specific domain, this paper's discussion on the challenges of AI validation practices is broadly applicable to ensuring the reliability and trustworthiness of AI systems. It highlights important considerations for robust AI development.
Abstract
Surgical data science (SDS) is rapidly advancing, yet clinical adoption of
artificial intelligence (AI) in surgery remains severely limited, with
inadequate validation emerging as a key obstacle. In fact, existing validation
practices often neglect the temporal and hierarchical structure of
intraoperative videos, producing misleading, unstable, or clinically irrelevant
results. In a pioneering, consensus-driven effort, we introduce the first
comprehensive catalog of validation pitfalls in AI-based surgical video
analysis that was derived from a multi-stage Delphi process with 91
international experts. The collected pitfalls span three categories: (1) data
(e.g., incomplete annotation, spurious correlations), (2) metric selection and
configuration (e.g., neglect of temporal stability, mismatch with clinical
needs), and (3) aggregation and reporting (e.g., clinically uninformative
aggregation, failure to account for frame dependencies in hierarchical data
structures). A systematic review of surgical AI papers reveals that these
pitfalls are widespread in current practice, with the majority of studies
failing to account for temporal dynamics or hierarchical data structure, or
relying on clinically uninformative metrics. Experiments on real surgical video
datasets provide the first empirical evidence that ignoring temporal and
hierarchical data structures can lead to drastic understatement of uncertainty,
obscure critical failure modes, and even alter algorithm rankings. This work
establishes a framework for the rigorous validation of surgical video analysis
algorithms, providing a foundation for safe clinical translation, benchmarking,
regulatory review, and future reporting standards in the field.