Abstract
Developing editing support for $L$ languages in $E$ editors is complex and
time-consuming. Some languages do not provide dedicated editors, while others
offer a single native editor. The $\textit{language server protocol}$ (LSP)
reduces the language-editor combinations $L \times E$ to $L + E$, where a
single language server communicates with editors via LSP plugins. However,
overlapping implementations of linguistic components remain an issue. Existing
language workbenches struggle with modularity, reusability, and leveraging type
systems for language server generation. In this work, we propose: (i) Typelang,
a family of domain-specific languages for modular, composable, and reusable
type system implementation, (ii) a modular language server generation process,
producing servers for languages built in a modular workbench, (iii) the
variant-oriented programming paradigm and a cross-artifact coordination layer
to manage interdependent software variants, and (iv) an LSP plugin generator,
reducing $E$ to $1$ by automating plugin creation for multiple editors. To
simplify editing support for language families, each language artifact
integrates its own Typelang variant, used to generate language servers. This
reduces combinations to $T \times 1$, where $T = L$ represents the number of
type systems. Further reuse of language artifacts across languages lowers this
to $N \times 1$, where $N << T$, representing unique type systems. We implement
Typelang in Neverlang, generating language servers for each artifact and LSP
plugins for three editors. Empirical evaluation shows a 93.48% reduction in
characters needed for type system implementation and 100% automation of LSP
plugin generation, significantly lowering effort for editing support in
language families, especially when artifacts are reused.
Delft University of Techn
Abstract
Large language models have shown impressive performance in various domains,
including code generation across diverse open-source domains. However, their
applicability in proprietary industrial settings, where domain-specific
constraints and code interdependencies are prevalent, remains largely
unexplored. We present a case study conducted in collaboration with the
leveling department at ASML to investigate the performance of LLMs in
generating functional, maintainable code within a closed, highly specialized
software environment.
We developed an evaluation framework tailored to ASML's proprietary codebase
and introduced a new benchmark. Additionally, we proposed a new evaluation
metric, build@k, to assess whether LLM-generated code successfully compiles and
integrates within real industrial repositories. We investigate various
prompting techniques, compare the performance of generic and code-specific
LLMs, and examine the impact of model size on code generation capabilities,
using both match-based and execution-based metrics. The findings reveal that
prompting techniques and model size have a significant impact on output
quality, with few-shot and chain-of-thought prompting yielding the highest
build success rates. The difference in performance between the code-specific
LLMs and generic LLMs was less pronounced and varied substantially across
different model families.