Abstract
As LLMs are customized, maintaining the balance between helpfulness and safety is vital. This talk introduces a modular paradigm for alignment using weight arithmetic. I will first present the Preference Vector framework, which enables test-time control over multi-preference alignment by merging behavior-specific vectors without retraining. I will then demonstrate how merging pre- and post-fine-tuned weights effectively restores safety guardrails lost during downstream adaptation, reducing attack success rates while improving performance. Together, these methods offer a scalable, data-efficient approach to safeguarding customized AI systems through efficient parameter-space operations.
Bio
Shang-Tse Chen is an Associate Professor in the Department of Computer Science and Information Engineering at National Taiwan University. He works at the intersection of applied and theoretical machine learning, with a strong application focus on cybersecurity. His research has led to patented cyber threat detection technology with Symantec, open-sourced adversarial attack and defense tools with Intel, and a deployed fire risk prediction system with the Atlanta Fire Rescue Department. He is a recipient of the K. T. Li Young Researcher Award in 2025. His recent research interests include various aspects of ML models' security, privacy, and fairness.