Papers from 15 to 19 September, 2025

Here are the personalized paper recommendations sorted by most relevant
Functional Programming
👍 👎 ♥ Save
Abstract
Even with the increase of popularity of functional programming, imperative programming remains a key programming paradigm, especially for programs operating at lower levels of abstraction. When such software offers key components of a Trusted Computing Base (TCB), e.g. an operating system kernel, it becomes desirable to provide mathematical correctness proofs. However, current real-world imperative programming languages possess "expressive", i.e. overly permissive, semantics. Thus, producing correctness proofs of such programs becomes tedious and error-prone, requiring to take care of numerous "administrative" details. Ideally, a proof-oriented imperative language should feature well-behaved semantics while allowing imperative idioms. To obtain a high-degree of confidence in the correctness of such a language, its tools should be developed inside a proof-assistant such that program proofs are machine checked. We present GallinaC, a shallow embedding of a Turing-complete imperative language directly inside the functional programming language of the Rocq proof assistant, Gallina. In particular, it features a truly generic and unbounded while loop. Having a functional core means proofs about GallinaC programs may use the same tactics as proofs about pure functional ones. Work on GallinaC is still under progress, but we present first promising results. A prototype implementation has shown the viability of GallinaC with the correctness proof of a list reversal procedure for linked-lists of unknown size. We currently focus on the forward simulation between the GallinaC intermediate representation (IR) and Cminor, the entry language of the CompCert back-end.
👍 👎 ♥ Save
Abstract
Large Language Models (LLMs) have made remarkable progress in mathematical reasoning, but still continue to struggle with high-precision tasks like numerical computation and formal symbolic manipulation. Integrating external tools has emerged as a promising approach to bridge this gap. Despite recent advances, existing methods struggle with three key challenges: constructing tool-integrated reasoning data, performing fine-grained optimization, and enhancing inference. To overcome these limitations, we propose THOR (Tool-Integrated Hierarchical Optimization via RL). First, we introduce TIRGen, a multi-agent actor-critic-based pipeline for constructing high-quality datasets of tool-integrated reasoning paths, aligning with the policy and generalizing well across diverse models. Second, to perform fine-grained hierarchical optimization, we introduce an RL strategy that jointly optimizes for both trajectory-level problem solving and step-level code generation. This is motivated by our key insight that the success of an intermediate tool call is a strong predictor of the final answer's correctness. Finally, THOR incorporates a self-correction mechanism that leverages immediate tool feedback to dynamically revise erroneous reasoning paths during inference. Our approach demonstrates strong generalization across diverse models, performing effectively in both reasoning and non-reasoning models. It further achieves state-of-the-art performance for models of a similar scale on multiple mathematical benchmarks, while also delivering consistent improvements on code benchmarks. Our code will be publicly available at https://github.com/JingMog/THOR.
Object Oriented Programming
👍 👎 ♥ Save
Paper visualization
Rate this image: 😍 👍 👎
Abstract
A key part of any dependent type-checker is the method for checking whether two types are equal. A common claim is that syntax-directed equality is more performant, although type-directed equality is more expressive. However, this claim is difficult to make precise, since implementations choose only one or the other approach, making a direct comparison impossible. We present some work-in-progress developing a realistic platform for direct, apples-to-apples, comparison of the two approaches, quantifying how much slower type-directed equality checking is, and analyzing why and how it can be improved.
👍 👎 ♥ Save
Abstract
Currently, many large language models (LLMs) are utilized for software engineering tasks such as code generation. The emergence of more advanced models known as large reasoning models (LRMs), such as OpenAI's o3, DeepSeek R1, and Qwen3. They have demonstrated the capability of performing multi-step reasoning. Despite the advancement in LRMs, little attention has been paid to systematically analyzing the reasoning patterns these models exhibit and how such patterns influence the generated code. This paper presents a comprehensive study aimed at investigating and uncovering the reasoning behavior of LRMs during code generation. We prompted several state-of-the-art LRMs of varying sizes with code generation tasks and applied open coding to manually annotate the reasoning traces. From this analysis, we derive a taxonomy of LRM reasoning behaviors, encompassing 15 reasoning actions across four phases. Our empirical study based on the taxonomy reveals a series of findings. First, we identify common reasoning patterns, showing that LRMs generally follow a human-like coding workflow, with more complex tasks eliciting additional actions such as scaffolding, flaw detection, and style checks. Second, we compare reasoning across models, finding that Qwen3 exhibits iterative reasoning while DeepSeek-R1-7B follows a more linear, waterfall-like approach. Third, we analyze the relationship between reasoning and code correctness, showing that actions such as unit test creation and scaffold generation strongly support functional outcomes, with LRMs adapting strategies based on task context. Finally, we evaluate lightweight prompting strategies informed by these findings, demonstrating the potential of context- and reasoning-oriented prompts to improve LRM-generated code. Our results offer insights and practical implications for advancing automatic code generation.
Programming Language Design
👍 👎 ♥ Save
Abstract
Developing editing support for $L$ languages in $E$ editors is complex and time-consuming. Some languages do not provide dedicated editors, while others offer a single native editor. The $\textit{language server protocol}$ (LSP) reduces the language-editor combinations $L \times E$ to $L + E$, where a single language server communicates with editors via LSP plugins. However, overlapping implementations of linguistic components remain an issue. Existing language workbenches struggle with modularity, reusability, and leveraging type systems for language server generation. In this work, we propose: (i) Typelang, a family of domain-specific languages for modular, composable, and reusable type system implementation, (ii) a modular language server generation process, producing servers for languages built in a modular workbench, (iii) the variant-oriented programming paradigm and a cross-artifact coordination layer to manage interdependent software variants, and (iv) an LSP plugin generator, reducing $E$ to $1$ by automating plugin creation for multiple editors. To simplify editing support for language families, each language artifact integrates its own Typelang variant, used to generate language servers. This reduces combinations to $T \times 1$, where $T = L$ represents the number of type systems. Further reuse of language artifacts across languages lowers this to $N \times 1$, where $N << T$, representing unique type systems. We implement Typelang in Neverlang, generating language servers for each artifact and LSP plugins for three editors. Empirical evaluation shows a 93.48% reduction in characters needed for type system implementation and 100% automation of LSP plugin generation, significantly lowering effort for editing support in language families, especially when artifacts are reused.
👍 👎 ♥ Save
Delft University of Techn
Abstract
Large language models have shown impressive performance in various domains, including code generation across diverse open-source domains. However, their applicability in proprietary industrial settings, where domain-specific constraints and code interdependencies are prevalent, remains largely unexplored. We present a case study conducted in collaboration with the leveling department at ASML to investigate the performance of LLMs in generating functional, maintainable code within a closed, highly specialized software environment. We developed an evaluation framework tailored to ASML's proprietary codebase and introduced a new benchmark. Additionally, we proposed a new evaluation metric, build@k, to assess whether LLM-generated code successfully compiles and integrates within real industrial repositories. We investigate various prompting techniques, compare the performance of generic and code-specific LLMs, and examine the impact of model size on code generation capabilities, using both match-based and execution-based metrics. The findings reveal that prompting techniques and model size have a significant impact on output quality, with few-shot and chain-of-thought prompting yielding the highest build success rates. The difference in performance between the code-specific LLMs and generic LLMs was less pronounced and varied substantially across different model families.
Design Patterns
👍 👎 ♥ Save
Abstract
While Large Language Models (LLMs) have demonstrated impressive reasoning and planning abilities in textual domains and can effectively follow instructions for complex tasks, their capacity for spatial understanding and reasoning remains limited. Such capabilities, however, are critical for applications like content-aware graphic layout design, which demands precise placement, alignment, and structural organization of multiple elements within constrained visual spaces. To address this gap, we propose LaySPA, a reinforcement learning-based framework that augments LLM agents with explicit spatial reasoning capabilities. LaySPA leverages hybrid reward signals that capture geometric validity, structural fidelity, and visual quality, enabling agents to model inter-element relationships, navigate the canvas, and optimize spatial arrangements. Through iterative self-exploration and adaptive policy optimization, LaySPA produces both interpretable reasoning traces and structured layouts. Experimental results demonstrate that LaySPA generates structurally sound and visually appealing layouts, outperforming larger general-purpose LLMs and achieving results on par with state-of-the-art specialized layout models.
Programming Paradigms
👍 👎 ♥ Save
Paper visualization
Rate this image: 😍 👍 👎
Abstract
Operations research (OR) provides fundamental methodologies for complex system decision-making, with established applications in transportation, supply chain management, and production scheduling. Traditional approaches, which depend on expert-based modeling and manual parameter adjustment, often face challenges in handling large-scale, dynamic, and multi-constraint problems. Recently, large language models (LLMs) have shown potential to address these limitations through semantic understanding, structured generation, and reasoning control. LLMs can translate natural language descriptions into mathematical models or executable code, generate heuristics, evolve algorithms, and directly tackle optimization tasks. This paper surveys recent progress on the integration of LLMs into OR, organizing methods into three main directions: automatic modeling, auxiliary optimization, and direct solving. It further reviews evaluation benchmarks and domain-specific applications, and summarizes key open issues such as unstable semantic-to-structure mapping, fragmented research progress, limited generalization, and insufficient evaluation systems. Finally, the survey outlines possible research avenues for advancing the role of LLMs in OR.
Unsubscribe from these updates