Attending this event?
View analytic
Wednesday, September 26 • 09:00 - 10:00
Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics

Sign up or log in to save this to your schedule and see who's attending!

UTF-8 is taking on an increasingly important role in text processing. Many applications require the conversion of UTF-8 to UTF-16 or UTF-32, but typical conversion algorithms are sub-optimal. This talk will describe a fast, correct, DFA-based approach to UTF-8 conversion that requires only three simple lookup tables and a small amount of straightforward C++ code.

We'll begin with a quick review UTF-8 and its relation to UTF-16 and UTF-32, as well as the concept of code units and code points. Next, we'll look at the layout of bits within a UTF-8 byte sequence, and from that, show a simple algorithm for converting from UTF-8 to UTF-32. Along the way will be a definition of overlong and invalid byte sequences. Following that will be a discussion of how to construct a DFA to perform the same operations as the simple algorithm. We'll then look at code for the DFA traversal underlying the basic conversion algorithm, and how to gain an additional performance boost by using SSE intrinsics.

Finally, we'll compare the performance of this approach to several commonly-available implementations on Windows and Linux, and show how it's possible to do significantly faster conversions.

avatar for Bob Steagall

Bob Steagall

CppCon Poster Chair, KEWB Computing
I've been working in C++ for the last 25 years. The majority of my career has been spent in medical imaging, where I led teams building applications for functional MRI and CT-based cardiac visualization. After a brief detour through the world of DNS and analytics, I'm now working... Read More →

Wednesday September 26, 2018 09:00 - 10:00

Attendees (22)