Functional Programming

Functional Python Programming in Introductory Computer Science Courses

Georgia State University

Rate paper: 👍 👎 ♥ Save

Abstract
The functional programming paradigm has a long and storied history, with its beginnings in the Lambda Calculus. In recent decades, pure functional languages such as Haskell have been shown to be highly effective in producing robust software due to immutable data structures, among other functional features. The advantages of programming with immutable data structures can also be had in non-functional languages such as Python. Over the years, non-functional languages have introduced immutable data structures as well as comprehension and lambda expressions, and it is possible to program in a purely functional style in them. In this paper, we present a ``best practice'' idea in introductory programming classes that forces students to learn and complete programming assignments in a purely functional subset of Python. By doing so, the student can learn functional ideas such as immutability, pure functions with no side effects, and stateless programming. We define a functional subset of Python and illustrate the best practice using small examples. We strongly feel that students in computing need familiarity with pure functional programming and argue that this can be taught in introductory programming courses that use Python.

AI Summary

By teaching functional programming early on, students can develop good coding habits and produce more expressive and high-level code. [3]
The paper argues that incorporating functional programming in CS1 and CS2 courses can help students become better programmers and improve their problem-solving skills. [3]
Functional programming paradigm: a programming style that emphasizes the use of pure functions, immutability, and declarative constructs to produce robust software systems. [3]
Higher-order functions: functions that take other functions as arguments or return functions as output, enabling more expressive and high-level code. [3]
Pure functions: functions with no side effects, producing the same output for a given input every time, making them easier to test and debug. [3]
The best practice introduced in this paper has been successfully implemented in a CS2 course, generating positive outcomes and student engagement. [3]
The approach emphasizes the use of higher-order functions, lambda expressions, and immutable data structures to promote robust and reliable software systems. [2]
The paper introduces a 'best practice' idea for teaching functional programming in introductory computer science courses using Python. [1]

Using functional information for binary classifications

Dartmouth

Rate paper: 👍 👎 ♥ Save

Abstract
The adequate use of information measured in a continuous manner along a period of time represents a methodological challenge. In the last decades, most of traditional statistical procedures have been extended for accommodating these functional data. The binary classification problem, which aims to correctly identify units as positive or negative based on marker values, is not aside of this scenario. The crucial point for making binary classifications based on a marker is to establish an order in the marker values, which is not immediate when these values are presented as functions. Here, we argue that if the marker is related to the characteristic under study, a trajectory from a positive participant should be more similar to trajectories from the positive population than to those drawn from the negative. With this criterion, a classification procedure based on the distance between the involved functions is proposed. Besides, we propose a fully non-parametric estimator for this so-called probability-based criterion, PBC. We explore its asymptotic properties, and its finite-sample behavior from an extensive Monte Carlo study. The observed results suggest that the proposed methodology works adequately, and frequently better than its competitors, for a wide variety of situations when the sample size in both the training and the testing cohorts is adequate. The practical use of the proposal is illustrated from real-world dataset. As online supplementary material, the manuscript includes a document with further simulations and additional comments. An R function which wraps up the implemented routines is also provided.

AI Summary

Functional data analysis is a statistical approach that deals with data that can be represented as functions. [3]
The receiver operating characteristic (ROC) curve is a graphical representation of the performance of a binary classifier. [3]
The area under the ROC curve (AUC) is a widely used metric for evaluating the performance of a binary classifier. [3]
A new method called the generalized ROC (gROC) curve has been proposed to handle non-monotone relationships between the true positive rate and false positive rate. [3]
The gROC curve can be estimated using a two-stage approach, where the first stage involves estimating the eCDFs and the second stage involves estimating the ROC curve. [3]
The results showed that the gROC curve estimator performed well in terms of accuracy and precision compared to other existing methods. [3]
The gROC curve can be used as a tool for evaluating the performance of binary classifiers, especially when there are non-monotone relationships between the true positive rate and false positive rate. [3]
A new R package called logitFD has been developed to implement functional principal component logit regression. [3]
The logitFD package provides functions for estimating the eCDFs and the ROC curve using a two-stage approach. [3]
A simulation study was conducted to evaluate the performance of the gROC curve estimator compared to other existing methods. [1]

Programming Language Design

InstructLR: A Scalable Approach to Create Instruction Dataset for Under-Resourced Languages

Rochester Institute of T

Rate paper: 👍 👎 ♥ Save

Abstract
Effective text generation and chat interfaces for low-resource languages (LRLs) remain a challenge for state-of-the-art large language models (LLMs) to support. This is mainly due to the difficulty of curating high-quality instruction datasets for LRLs, a limitation prevalent in the languages spoken across the African continent and other regions. Current approaches, such as automated translation and synthetic data generation, frequently yield outputs that lack fluency or even orthographic consistency. In this paper, we introduce InstructLR, a novel framework designed to generate high-quality instruction datasets for LRLs. Our approach integrates LLM-driven text generation with a dual-layer quality filtering mechanism: an automated filtering layer based on retrieval-augmented-generation (RAG)-based n-shot prompting, and a human-in-the-loop validation layer. Drawing inspiration from benchmarks such as MMLU in task definition, InstructLR has facilitated the creation of three multi-domain instruction benchmarks: ZarmaInstruct-50k, BambaraInstruct-50k, and FulfuldeInstruct-50k.

AI Summary

The framework leverages previous approaches, including self-instruction and translation, while addressing their limitations through a robust LRL-aware dual-layer quality filtering process. [3]
The InstructLR framework is a unified approach for generating quality synthetic instruction data for Low-Resource Languages (LRLs) with minimal human intervention. [2]
The system operates through a workflow that includes input text analysis, relevant grammar rule retrieval, context incorporation into a prompt, and structured assessment production. [1]

Design Patterns

Step-by-step Layered Design Generation

King Abdullah University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Design generation, in its essence, is a step-by-step process where designers progressively refine and enhance their work through careful modifications. Despite this fundamental characteristic, existing approaches mainly treat design synthesis as a single-step generation problem, significantly underestimating the inherent complexity of the creative process. To bridge this gap, we propose a novel problem setting called Step-by-Step Layered Design Generation, which tasks a machine learning model with generating a design that adheres to a sequence of instructions from a designer. Leveraging recent advancements in multi-modal LLMs, we propose SLEDGE: Step-by-step LayEred Design GEnerator to model each update to a design as an atomic, layered change over its previous state, while being grounded in the instruction. To complement our new problem setting, we introduce a new evaluation suite, including a dataset and a benchmark. Our exhaustive experimental analysis and comparison with state-of-the-art approaches tailored to our new setup demonstrate the efficacy of our approach. We hope our work will attract attention to this pragmatic and under-explored research area.

AI Summary

Step 1: Training the decoder module Step 2: Training the model for 12,000 steps Step 3: Reducing batch size to 4 per GPU The supplementary document provides additional information that could not be included in the main paper due to space constraints. [3]
(2023), Ge et al. [3]
(2024), and Touvron et al. [3]
The supplementary document is like an extra chapter in a book that has more details about the research paper. [3]
It includes things like how they trained their model, what they did to make it work better, and some examples of what it can do. [3]
The supplementary document includes implementation details of the approach, qualitative results comparing with the strongest baseline, circular evaluation and prompts used for dataset generation and evaluation protocol, samples from the training dataset, failure cases of the model, and high-resolution samples. [2]

Tokenizing Buildings: A Transformer for Layout Synthesis

Higharc

Rate paper: 👍 👎 ♥ Save

Abstract
We introduce Small Building Model (SBM), a Transformer-based architecture for layout synthesis in Building Information Modeling (BIM) scenes. We address the question of how to tokenize buildings by unifying heterogeneous feature sets of architectural elements into sequences while preserving compositional structure. Such feature sets are represented as a sparse attribute-feature matrix that captures room properties. We then design a unified embedding module that learns joint representations of categorical and possibly correlated continuous feature groups. Lastly, we train a single Transformer backbone in two modes: an encoder-only pathway that yields high-fidelity room embeddings, and an encoder-decoder pipeline for autoregressive prediction of room entities, referred to as Data-Driven Entity Prediction (DDEP). Experiments across retrieval and generative layout synthesis show that SBM learns compact room embeddings that reliably cluster by type and topology, enabling strong semantic retrieval. In DDEP mode, SBM produces functionally sound layouts, with fewer collisions and boundary violations and improved navigability.

Programming Paradigms

Thinking with Programming Vision: Towards a Unified View for Thinking with Images

Zhejiang University

Rate paper: 👍 👎 ♥ Save

Rate image: 👍 👎

Abstract
Multimodal large language models (MLLMs) that think with images can interactively use tools to reason about visual inputs, but current approaches often rely on a narrow set of tools with limited real-world necessity and scalability. In this work, we first reveal a critical and previously overlooked weakness: even state-of-the-art MLLMs are surprisingly brittle, showing significant performance degradation on images with simple orientation changes or natural corruptions, underscoring the need for more robust tool-based reasoning. To address this, we propose CodeVision, a flexible and scalable code-as-tool framework where the model generates code as a universal interface to invoke any image operation, moving beyond fixed tool registries. We train our model using a two-stage methodology, beginning with Supervised Fine-Tuning (SFT) on a high-quality dataset curated for complex, multi-turn tool composition and error recovery, followed by Reinforcement Learning (RL) with a novel and dense process reward function to encourage strategic and efficient tool use. To facilitate this research, we construct new SFT and RL datasets and introduce a challenging new benchmark suite designed to rigorously evaluate robustness to orientation changes and multi-tool reasoning. Experiments on Qwen2.5-VL and Qwen3-VL series show that our approach significantly improves model performance and fosters emergent capabilities such as flexible tool composition, efficient chained execution, and robust error recovery from runtime feedback. Code is available at https://github.com/ByteDance-BandAI/CodeVision.

AI Summary

CodeVision is trained using a two-stage process combining Self-Training (SFT) and Reinforcement Learning (RL) with a dense, process-oriented reward function. [3]
Model Brittleness: The tendency of models to fail or produce suboptimal results when faced with unexpected inputs or situations. [3]
Self-Training (SFT): A training method where a model is trained on its own predictions, allowing it to learn from its mistakes and improve over time. [3]
Reinforcement Learning (RL): A type of machine learning where an agent learns to take actions in an environment to maximize a reward signal. [3]
The CodeVision framework has the potential to create more capable and robust visual agents that can reason and interact with their environment in a more flexible and effective way. [3]
Model brittleness: The tendency of models to fail or produce suboptimal results when faced with unexpected inputs or situations. [3]
The paper proposes a framework called CodeVision that enables visual agents to treat visual interaction as a programming task. [2]

Algorithmic Thinking Theory

Google

Rate paper: 👍 👎 ♥ Save

Abstract
Large language models (LLMs) have proven to be highly effective for solving complex reasoning tasks. Surprisingly, their capabilities can often be improved by iterating on previously generated solutions. In this context, a reasoning plan for generating and combining a set of solutions can be thought of as an algorithm for reasoning using a probabilistic oracle. We introduce a theoretical framework for analyzing such reasoning algorithms. This framework formalizes the principles underlying popular techniques for iterative improvement and answer aggregation, providing a foundation for designing a new generation of more powerful reasoning methods. Unlike approaches for understanding models that rely on architectural specifics, our model is grounded in experimental evidence. As a result, it offers a general perspective that may extend to a wide range of current and future reasoning oracles.

AI Summary

q: the parameter of the oracle p: the initial success probability k: the number of solutions passed to the oracle x: the correctness probability of independent solutions The maximum achievable success probability is given by maxp,2−1/q. [3]
The optimal amount of independent solutions that are correct with probabilityxto pass to the oracleA(eq,p)dforx≥pisargmaxk∈N≥0qk−1(1−(1−x)k). [2]
The maximum achievable success probability of a(A(eq,p)d, n)-reasoning algorithm is given by maxp,2−1/q. [1]

Object Oriented Programming

OOPredictor: Predicting Object-Oriented Accesses using Static Analysis

University of New Brunsw

Rate paper: 👍 👎 ♥ Save

Abstract
Object-oriented Programming has become one of the most dominant design paradigms as the separation of concerns and adaptability of design reduce development and maintenance costs. However, the convenience is not without cost. The added indirection inherent in such designs causes excessive pointer chasing, negatively affecting locality, which in turn degrades the performance of cache structures. Furthermore, modern hardware prefetchers are mostly stride prefetchers that are ill-equipped to handle the unpredictability of access patterns generated by pointer chasing. Most software approaches that seek to address this problem resort to profiling the program as it runs, which comes with a significant run-time overhead or requires data from previous runs. In this paper, we propose the use of compile-time static analysis to predict the most common access patterns displayed by a program during run time. Since Java is one of the most popular object-oriented languages, we implement our prototype within the OpenJ9 JVM, inside the OMR optimizer infrastructure. The outputs of our proposed predictor are Markov chains that model the expected behavior of the program. The effectiveness of the proposed predictor is evaluated by comparing the model with the actual run-time behavior of the program measured using an instrumented interpreter. Our experiments show that the proposed predictor exhibits good accuracy and can be used to inform minimally intrusive load stall mitigation strategies, e.g. informing copying GCs on more locality-friendly copying orders

AI Summary

The predictor is consistently accurate for certain methods while frequently failing to make accurate predictions for others. [3]
The size of the resulting model is a better indicator of predictor accuracy than plain method size. [3]
The proposed predictor can provide useful information about the runtime object-oriented access pattern of a program at the compilation time of its methods. [3]
The models can be used to extract class affinity information about the program at compile time which can then be used by optimization strategies either in the compiler or the GC to improve performance. [3]
Affinity Graphs: Graphs that represent the relationships between classes and their objects, used for optimization strategies. [3]
Object-Oriented Access Patterns can be predicted at compile-time using the OOPredictor, which uses machine learning to identify patterns in object access. [2]

Help us improve your experience!