Abstract
Benchmark accuracy has become a standard way to measure progress in language models, yet it often fails to capture whether a model is reasoning robustly or relying on superficial shortcuts. This talk presents the core idea behind the SCARE project, a perturbation-based evaluation framework designed to probe reasoning stability in arithmetic word problems. The central approach is simple: apply controlled modifications that preserve the underlying answer, then examine whether the model’s prediction remains consistent. Through perturbation families such as context padding, math-safe lexical substitution, symbolic numeric re-encoding, and premise reordering, SCARE reveals forms of brittleness that are invisible to ordinary accuracy-based evaluation. I will discuss the design principles of the framework, its role as a diagnostic tool for reasoning robustness, and broader implications for evaluating and improving trustworthy AI systems.