Abstract
Large Language Models (LLMs) like GPT-4 have demonstrated remarkable capabilities through pre-training on vast text corpora. However, their raw outputs often fall short of the nuanced expectations of human users. In this talk, I will explore the critical role of post-training techniques in aligning LLM behavior with human preferences. We begin with an overview of LLM pre-training through next-token prediction, then investigate why this objective alone is insufficient. Drawing on insights from reinforcement learning, we delve into post-training strategies including supervised fine-tuning and preference optimization methods like RLHF and Direct Preference Optimization (DPO). I will also present our recent findings—highlighting the surprising power of on-policy, suboptimal data in preference learning—and discuss the implications for future LLM alignment research. This talk is intended for audiences interested in LLMs, alignment and post-training.