My Journey on Trustworthy Reinforcement Learning: From Online Decision Making to Large Language Models
Prof. Will Wei Sun
Associate Professor of Quantitative Methods and Statistics
Daniels School of Business
Purdue University
Reinforcement learning plays a critical role in decision-making systems that operate in dynamic and uncertain environments. In this talk, I will share my research journey in developing principled methods to ensure statistical reliability in reinforcement learning, spanning from online decision-making to large language models (LLMs). First, I will introduce a low-rank contextual bandit framework, where the model parameter exhibits a low-rank structure. While existing bandit and reinforcement learning algorithms primarily focus on reward maximization, statistical inference remains underexplored due to the challenges posed by adaptive data collection and model complexity. To address these issues, we propose an online doubly-debiased inference procedure that corrects biases arising from both non-convexity and data adaptivity, enabling valid uncertainty quantification in sequential decision-making. Next, I will discuss reinforcement learning from human feedback (RLHF) in LLMs, where aligning AI-generated responses with human preferences is complicated by the heterogeneity of feedback sources. To tackle this challenge, we introduce a dual active learning framework that strategically selects both informative conversations and optimal human annotators using a D-optimal design. This approach enhances reward learning by minimizing generalized variance, ultimately improving the efficiency and reliability of aligning LLMs with human values.
Will Wei Sun is an Associate Professor of Management at the Daniels School of Business, Purdue University, and is also affiliated with the Department of Statistics. Previously, he worked as a research scientist on the advertising science team at Yahoo Labs. Dr. Sun’s research centers on trustworthy reinforcement learning, statistical foundations of large language models, and online decision-making in two-sided markets. His research has been partially supported by grants from the National Science Foundation, the Office of Naval Research, and the Ross-Lynn Research Scholar Fund.