“It was impressive that Preston led this work as an MCS student with a non-traditional background in computing.”

That’s what Siebel School of Computing and Data Science in The Grainger College of Engineering at the University of Illinois Urbana-Champaign associate professor Sasa Misailovic said when asked about Preston Firestone and  UTF-8 Plumbing: Byte-level Tokenizers Unavoidably Enable LLMs to Generate Ill-formed UTF-8 paper’s acceptance at the upcoming 2025 Conference on Language Modeling (COLM), which takes place from October 7 to 10, 2025, in Montreal, Canada.

Four headshots(L to R) Preston Firestone, Sasa Misailovic, Gagandeep Singh, Shubham Ugare

Firestone is a Master of Computer Science student in the MCS Chicago program. He admits that, “honestly, I felt like I had gotten away with something.”

Misailovic says, “Preston attended my CS 591 seminar on programming and AI safety in Chicago back in Spring 2024. He was looking to get research experience, and we figured out a possible idea where he could help my (then) PhD student, Shubham Ugare.”

Firestone is lead author of the paper, alongside Misailovic, Meta research scientist Ugare (’25 Ph.D. Computer Science) and CS assistant professor Gagandeep Singh.

Firestone says of the research team, “They are allowing me to contribute to science as a whole, or at least to my professional development. My primary goal for the research work, however, was to see whether I’d enjoy it, and to learn what exactly the work consisted of: I’d like to have more information about what it would entail before committing to a doctoral program. And I discovered that I do, in fact, enjoy it!” 

Blue and green squares with text surrounding the text Large Language Model (LLM).“We have been working on a novel approach for constrained generation as part of our Structured LLM initiative, which controls large language models (LLMs) to generate text that conforms to user-defined rules,” Misailovic explains. “In each step, an LLM generates tokens, which are typically words or several letters — parts of words; however, for non-Latin alphabets and some math notation, tokens can even be just a part of individual symbols. And as Preston was implementing the system we thought of, he encountered an unexpected problem.” 

“Preston and Shubham identified that the problem appears when generating tokens for specific math formulas. Then they identified that the same issues also appear when generating human languages written in non-Latin scripts, including those written in Devanagari (for Indian texts), Cyrillic (for Slavic languages), and others.” 

He continues, “Preston then decided to study and develop a new theoretical framework to explain that the current abstractions LLM developers use when processing text with LLMs are not sufficient to shelter us from problems with character encoding. In this work, he connects many threads from machine learning, linguistics, programming languages and theoretical computer science communities. He also studied the existing empirical techniques for fixing the problem and devised a way to fix the issue in our system.”

Firestone obtained his undergraduate degree from the University of St. Andrew’s in Scotland, studying philosophy and theology. He says, “At the end of my bachelor’s, I decided to switch topics from philosophy and theology to computer science.”

Learning that Siebel School of Computing and Data Science in The Grainger College of Engineering at the University of Illinois Urbana-Champaign offers the Illinois Computing Accelerator for Non-Specialists (iCAN) program, he decided to enroll. “My plan always was to complete a master’s, and after iCAN, only the MCS is possible without the special assistance of a sponsoring professor. I chose Chicago because I was already living there.”

White text: Illinois Computing Accelerator for Non-Specialists (iCAN) with a photo of two women looking at a computer with an orange background.

Firestone met Misailovic in Chicago.  Firestone notes that “the special characteristic of the Chicago MCS program is that the courses are quite small and one thereby has direct and personal access to the professors, while being supported in the background by a large research institution. This combination of liberal-arts-style intimacy with R1-level resources gives excellent opportunities to those who aggressively profit from the access afforded to the professors. Had Sasa not caught me in the office one Thursday after I’d emailed him asking for a recommendation, I wouldn’t be here now. And if it weren’t for the intimate scale of the Chicago program, Sasa might not have known who I was, much less taken a personal interest in me to offer me the position on the project. And had I not made the effort to be in the office to socialize with and encounter professors and students, I wouldn’t have been known to Sasa by face and name. “

Upon hearing of Firestone’s interest in pursuing a PhD, Misailovic connected Firestone with Ugare. “I had always planned to continue in academia after the MCS,” Firestone recalls, “so I began asking professors for recommendations for doctoral programs during my last semester in the MCS. Sasa asked me what was on my resume and, realizing it was insufficient to qualify me for a PhD program, assigned me to Shubham’s SynCode project as a software developer. Luckily, I had already taken CS421, so I was prepared to wrangle LR parsers.”

Now, Firestone has a research paper under his belt. Misailovic concludes that “What is impressive about Preston’s work is that it started as a side project when solving a practical systems problem with structured LLM tools and evolved into a general statement about many current frameworks for running LLMs. This work brings to attention the need to think systematically about how to improve abstractions when developing new LLM constrained generation frameworks.”

Speaking of the MCS program, he says, “Our Chicago program is growing, and it is bringing together students of diverse backgrounds, some from different disciplines (like Preston, who studied philosophy in the past) and many with professional/industry experience.  I met other ambitious and creative students like Preston, who are willing to step outside of their comfort zone and make something new and exciting. The Chicago MCS program is helping those students discover new opportunities and skills — maybe even those they didn’t know they were capable of.”

Grainger Engineering Affiliations

Sasa Misailovic is an Illinois Grainger Engineering associate professor of computer science.

Gagandeep Singh is an Illinois Grainger Engineering assistant professor of computer science.