Uncovering the Internal Mechanisms of Large Language Models in Decision-Making under Risk
Mr. Yu Qin
Ph.D. Candidate in Information Systems
W.P. Carey School of Business
Arizona State University
Large Language Models (LLMs) have achieved remarkable performance across domains, yet their internal reasoning processes remain largely opaque. Understanding these mechanisms is critical for building responsible and trustworthy AI applications, particularly in high-stakes contexts such as decision-making under risk. Prior efforts to interpret LLMs generally fall into two categories. Behavioral approaches analyze model outputs to infer decision patterns but offer little insight into what happens inside the model. Mechanistic approaches, such as sparse autoencoders (SAEs), examine internal neuron activations directly but lack semantic control, yielding interpretations difficult to relate to meaningful concepts in business contexts. This disconnect limits our ability to link an LLM’s internal mechanisms with practically relevant decision behavior, thereby constraining its responsible use in business settings. To bridge this gap, we propose the Disentangled Cognitive Sparse Autoencoder (DC-SAE), a theory-guided framework that integrates supervised cognitive constraints into an unsupervised architecture. DC-SAE controllably maps an LLM’s internal neuron activations to human-aligned features through two latent subspaces, one guided by Prospect Theory and another for discovering emergent, non-theorized behaviors, jointly optimized to learn disentangled internal mechanisms that explain the LLM’s decision-making under risk. We evaluate DC-SAE on a 17,280-item risky-choice dataset grounded in Prospect Theory, where it achieves 10% higher predictive accuracy and 15% higher AUC than the best-performing benchmark while maintaining superior reconstruction fidelity of internal neuron activations. These results demonstrate that the learned subspaces reliably capture the mechanisms underlying the LLM’s decision behavior, rather than reflecting arbitrary latent patterns. Qualitative analysis confirms that LLMs exhibit a human-like cognitive structure consistent with Prospect Theory. However, they exhibit distinct decision tendencies, such as attenuated loss aversion and a dual-process mechanism in which intuitive and analytical features operate in parallel. To demonstrate real-world utility, we extend DC-SAE to a bank-loan approval dataset, showing that the learned cognitive interpretations enable targeted behavioral interventions that adjust model approval tendencies through interpretable factors. Our work establishes a controllable framework for mechanistic interpretation, offering a new path toward understanding and influencing LLM behavior. It contributes to the information systems literature by advancing the interpretability of LLMs and to practice by enabling responsible and controllable applications of LLMs in decision-making under risk.












