CppCon 2018 has ended
Back To Schedule
Wednesday, September 26 • 09:00 - 10:00
Fast Conversion From UTF-8 with C++, DFAs, and SSE Intrinsics

Sign up or log in to save this to your schedule, view media, leave feedback and see who's attending!

UTF-8 is taking on an increasingly important role in text processing. Many applications require the conversion of UTF-8 to UTF-16 or UTF-32, but typical conversion algorithms are sub-optimal. This talk will describe a fast, correct, DFA-based approach to UTF-8 conversion that requires only three simple lookup tables and a small amount of straightforward C++ code.

We'll begin with a quick review UTF-8 and its relation to UTF-16 and UTF-32, as well as the concept of code units and code points. Next, we'll look at the layout of bits within a UTF-8 byte sequence, and from that, show a simple algorithm for converting from UTF-8 to UTF-32. Along the way will be a definition of overlong and invalid byte sequences. Following that will be a discussion of how to construct a DFA to perform the same operations as the simple algorithm. We'll then look at code for the DFA traversal underlying the basic conversion algorithm, and how to gain an additional performance boost by using SSE intrinsics.

Finally, we'll compare the performance of this approach to several commonly-available implementations on Windows and Linux, and show how it's possible to do significantly faster conversions.

avatar for Bob Steagall

Bob Steagall

CppCon Poster Chair, KEWB Computing
I've been working in C++ since discovering the second edition of The C++ Programming Language in a college bookstore in 1992. The majority of my career has been spent in medical imaging, where I led teams building applications for functional MRI and CT-based cardiac visualization... Read More →

Wednesday September 26, 2018 09:00 - 10:00 PDT
Telluride (407)
  • Data Structures and Algorithms