Hierarchical Evaluation of Software Design Capabilities of Large Language Models of Code

Rate paper: 👍 👎 ♥ Save

AI Summary

For both concepts, performance drops significantly as tasks shift from verification to open-ended generation. [3]
The models' understanding of cohesion is more resilient to noise than their understanding of coupling. [3]
Cohesion refers to the degree to which a class's methods are related and work together to achieve a common goal. [3]
Coupling refers to the degree to which two or more classes depend on each other. [3]
Distortion refers to the presence of unrelated distractor classes in the code, making it harder for models to reason about inter-module relationships. [3]
The findings suggest that models are better suited for recognition-based tasks rather than autonomous generation tasks. [3]
The study only considers a limited set of code transformations and distortion levels. [3]
The study highlights the limitations of language models in software design and development tasks, particularly when faced with complex or noisy contexts. [2]
The study found that language models are better at recognizing principles (scores >0.80) but struggle with autonomous generation. [1]

Abstract
Large language models (LLMs) are being increasingly adopted in the software engineering domain, yet the robustness of their grasp on core software design concepts remains unclear. We conduct an empirical study to systematically evaluate their understanding of cohesion (intra-module) and coupling (inter-module). We programmatically generate poorly designed code fragments and test the DeepSeek-R1 model family ($14$B, $32$B, $70$B) under varying levels of guidance, from simple \textit{Verification} to \textit{Guided} and \textit{Open-ended Generation}, while varying contextual noise by injecting distractor elements. While models exhibit a solid baseline understanding of both concepts in ideal conditions, their practical knowledge is fragile and highly asymmetrical. Reasoning about coupling proves brittle; performance collapses in noisy, open-ended scenarios, with F1 scores dropping by over $50\%$. In contrast, the models' analysis of cohesion is remarkably robust to internal noise in guided tasks, showing little performance degradation. However, this resilience also fails when all guidance is removed. Reasoning-trace analysis confirms these failure modes, revealing \textit{cognitive shortcutting} for coupling versus a more exhaustive (yet still failing) analysis for cohesion. To summarize, while LLMs can provide reliable assistance for recognizing design flaws, their ability to reason autonomously in noisy, realistic contexts is limited, highlighting the critical need for more scalable and robust program understanding capabilities.

Why we think this paper is great for you:
This paper directly investigates core software design principles like cohesion and coupling, offering valuable insights into the foundational elements of robust software architecture. It will deepen your understanding of effective design practices.

Local generation of languages

Rate paper: 👍 👎 ♥ Save

Abstract
Given a language, which in this article is a set of strings of some fixed length, we study the problem of producing its elements by a procedure in which each position has its own local rule. We introduce a way of measuring how much communication is needed between positions. The communication structure is captured by a simplicial complex whose vertices are the positions and the simplices are the communication channels between positions. The main problem is then to identify the simplicial complexes that can be used to generate a given language. We develop the theory and apply it to a number of languages.

Why we think this paper is great for you:
You'll find this paper highly relevant as it explores the fundamental mechanisms behind language generation, which is key to understanding the principles of programming language design. It offers a unique perspective on how languages are structured.

From data to concepts via wiring diagrams

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

A proximal policy optimization (PPO) algorithm was used to train a computer agent through reinforcement learning, producing 284 episodes of the agent playing the game. [3]
The problem is to enable an autonomous system to learn what it means to perform a concept by observing multiple instances of the same concept. [2]

Abstract
A wiring diagram is a labeled directed graph that represents an abstract concept such as a temporal process. In this article, we introduce the notion of a quasi-skeleton wiring diagram graph, and prove that quasi-skeleton wiring diagram graphs correspond to Hasse diagrams. Using this result, we designed algorithms that extract wiring diagrams from sequential data. We used our algorithms in analyzing the behavior of an autonomous agent playing a computer game, and the algorithms correctly identified the winning strategies. We compared the performance of our main algorithm with two other algorithms based on standard clustering techniques (DBSCAN and agglomerative hierarchical), including when some of the data was perturbed. Overall, this article brings together techniques in category theory, graph theory, clustering, reinforcement learning, and data engineering.

Why we think this paper is great for you:
This paper delves into representing abstract concepts using wiring diagrams, which could provide an interesting conceptual framework for modeling and understanding different programming paradigms. It offers a novel way to visualize complex system structures.

Stochastic Sequential Quadratic Programming for Optimization with Functional Constraints

Rate paper: 👍 👎 ♥ Save

AI Summary

ηt: step size at iteration t δt: expected squared distance between iterate and optimum at iteration t ∆t: optimality gap at iteration t wT: maximum constraint violation at iteration T κ: ratio of L to µ σ: noise variance The algorithm achieves optimal convergence rates in terms of SFO/QMO complexity for both convex and strongly convex cases. [3]
The problem statement is a research paper on the convergence rate of an algorithm for solving optimization problems with constraints. [2]

Abstract
Stochastic convex optimization problems with nonlinear functional constraints are ubiquitous in machine learning applications, including multi-task learning, structured prediction, and multi-view learning. The presence of nonlinear functional constraints renders the traditional projected stochastic gradient descent and related projection-based methods inefficient, and motivates the use of first-order methods. However, existing first-order methods, including primal and primal-dual algorithms, typically rely on a bounded (sub-)gradient assumption, which may be too restrictive in many settings. We propose a stochastic sequential quadratic programming (SSQP) algorithm that works entirely in the primal domain, avoids projecting onto the feasible region, obviates the need for bounded gradients, and achieves state-of-the-art oracle complexity under standard smoothness and convexity assumptions. A faster version, namely SSQP-Skip, is also proposed where the quadratic subproblems can be skipped in most iterations. Finally, we develop an accelerated variance-reduced version of SSQP (VARAS), whose oracle complexity bounds match those for solving unconstrained finite-sum convex optimization problems. The superior performance of the proposed algorithms is demonstrated via numerical experiments on real datasets.

Why we think this paper is great for you:
While the title mentions "functional constraints," this paper focuses on optimization techniques in machine learning, which is a different domain from your primary interests in programming paradigms. It's less directly aligned with your core areas.

DesignPref: Capturing Personal Preferences in Visual Design Generation

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

AI Summary

Machine Learning (ML): A subset of AI that involves training algorithms to learn from data and improve their performance over time. [3]
Lack of transparency and explainability in AI decision-making processes. [3]
The field of user interface (UI) design is rapidly evolving with the advent of artificial intelligence (AI) and machine learning (ML). [2]

Abstract
Generative models, such as large language models and text-to-image diffusion models, are increasingly used to create visual designs like user interfaces (UIs) and presentation slides. Finetuning and benchmarking these generative models have often relied on datasets of human-annotated design preferences. Yet, due to the subjective and highly personalized nature of visual design, preference varies widely among individuals. In this paper, we study this problem by introducing DesignPref, a dataset of 12k pairwise comparisons of UI design generation annotated by 20 professional designers with multi-level preference ratings. We found that among trained designers, substantial levels of disagreement exist (Krippendorff's alpha = 0.25 for binary preferences). Natural language rationales provided by these designers indicate that disagreements stem from differing perceptions of various design aspect importance and individual preferences. With DesignPref, we demonstrate that traditional majority-voting methods for training aggregated judge models often do not accurately reflect individual preferences. To address this challenge, we investigate multiple personalization strategies, particularly fine-tuning or incorporating designer-specific annotations into RAG pipelines. Our results show that personalized models consistently outperform aggregated baseline models in predicting individual designers' preferences, even when using 20 times fewer examples. Our work provides the first dataset to study personalized visual design evaluation and support future research into modeling individual design taste.

Why we think this paper is great for you:
This paper explores visual design generation, focusing on user interfaces and aesthetic preferences, which is distinct from your interests in software design patterns or programming language design. It falls outside your primary research focus.

Interests not found

Help us improve your experience!