Obedience Theatre: Do Rule-Heavy System Prompts Produce Real Policy Compliance or Just Better Acting?
When an assistant is given detailed internal rules, does it genuinely follow policy better, or does it simply learn to sound compliant? This paper examines the gap between behavioural compliance signals and actual policy adherence in large language models under heavily constrained system prompts. Drawing on a series of structured prompt experiments, we distinguish between surface-level compliance theatre and deeper behavioural policy internalisation.