Abstract
In this talk, I will share our team’s journey in developing the SEA-LION LLMs—a family of open large language models designed to better understand Southeast Asian languages, cultural nuances, and context. I will discuss our motivations, key design decisions, and the technical (as well as non-technical) challenges we faced throughout development. I will also highlight how our commitment to regional community-building and the creation of a comprehensive evaluation suite for Southeast Asia have uniquely positioned us in a competitive landscape, enabling strategic co-development partnerships with regional leaders like Gojek and global AI companies such as NVIDIA and DeepMind. Finally, I will outline our vision for leveraging the progress of the past two years to shape the next chapter of SEA-LION.
Bio
William Tjhi has nearly two decades of experience in applying machine learning to industry problems. He earned his PhD from NTU in 2008, focusing on unsupervised learning for text data. His career includes time at A*STAR, where he scaled up ML with distributed systems, and at GovTech, where he contributed to early data science efforts. As the lead NLP at Traveloka, he tackled the challenges of NLP in low-resource Bahasa Indonesia, inspiring him to initiate a program for building NLP resources for Southeast Asian languages at AI Singapore. William was a foundational engineer for AI Singapore's 100 Experiments and AI Apprenticeship Program. Currently, he leads applied research on Regional LLMs in AI Singapore's AI Products division, provides AI technical advisory to MDDI Translation Technology, and actively participates in regional AI communities like Data Science Indonesia, Data Science SG, and Cambodia's AI Forum.