Strings, encodings, NULs and Bazel
A story on how strings without NULs are problematic for interop with the OS
Just yesterday, Twitter user @vkrajacic wrote:
Advice for new C programmers: “Avoid null-terminated strings; they’re outdated, inefficient and impractical.”
Create your own type with basic functions. It’s not that hard, and it goes a long way. One of the benefits of this approach, among others, is slicing without copying.
This suggestion has its merits and I understand where it is coming from: performance. You see: the traditional way to represent strings in C is to use NUL-terminated byte arrays. Yet… this has deemed to be the most expensive one-byte mistake because of the adverse performance implications that this carries. (NUL, not NULL, is the better name for the \0
byte by the way.)
It is of course possible to do differently. Pascal, for example, used a 1-byte prefix to indicate how long strings are. This representation takes the same space as a NUL terminator and fixes the performance problems, but it carries the downside of limiting strings to 255 bytes in length. And you can do as the original author said: amend your strings with a (larger) size field so you do not suffer from this limitation.
So, yes, it is possible to use strings without NUL terminators. Unfortunately, you are in for pain if you do so for one simple reason: interop. Your code does not run in a vacuum: it runs in the context of an existing operating system, and it probably uses one or more libraries. Almost all operating systems to date, if not all, expose a C API (libc and system calls), which means that they expect strings to be NUL-terminated. And the same is true for all interesting libraries out there.
And for this reason, I want to tell you a little story about how I found this painful problem when I was working on Bazel a few years ago.
Bazel is primarily a Java program. Bazel also has a bunch of C and C++ code integrated via JNI, which is used to reach certain system calls and libraries that are only available in C. For historical reasons, Java uses UTF-16 to represent strings but… most operating systems out there do not. As a consequence, every JNI call has to start with a conversion from UTF-16 into whatever the operating system may accept, which typically is Latin-1 or UTF-8. This is a costly, unavoidable step.
Fortunately, Java 9 introduced a new feature called compact strings. When this feature is enabled, the JVM represents strings as Latin-1 wherever possible: as long as the string’s characters can be represented with this encoding, UTF-16 doesn’t enter the picture. This is great because most strings that Bazel handles are file paths—it handles a lot of paths—and these can typically be represented in Latin-1. With this feature, we could modify the JNI shims to reuse the in-memory bytes as is, without a costly conversion in the common case.
But there is an unfortunate twist. The Latin-1 compact strings that the JVM creates are not NUL-terminated. This means that, even if the string bytes that Java hands to C are exactly as we need them in memory to call some other API, they are not directly usable. As a result, the JNI code is forced to make a copy of the string just so that it can pad it with a NUL terminator.
Which is… sad and wasteful. As I mentioned, Bazel handles a ton of paths. Most of the memory bloat in the Bazel server process comes from the need to track file paths and command line arguments, and when you have many of these strings amounting to GBs of RAM, you can imagine that processing and copying them is costly too. I think I did measure a not-insignificant runtime penalty from these unnecessary copies back in the day, but I forget the details now.
So, be careful: it’s entirely reasonable to annotate string representations with their size, and you should do so where possible because of the performance gains that come with it. But when you do this, don’t forget to pad the strings with a NUL character for the cases where you need interop. You don’t want to be making unnecessary string copies just because of this, and you don’t know when you’ll need the interop.
C++’s std::string
, for example, uses a combination of a NUL terminator and a size length, which allows it to be efficient for manipulation but also allows passing the “raw bytes” to C interfaces. I suppose Go and Rust do the same (they don’t as I’ve been told). Strings are hard though and some languages do better than others in handling them.
There is an extra dimension to path string pain - interoperability of path strings across machine types. The difference in standard UTF-8 encoding between macos and everyone else raises all sorts of "what is correct?" issues. Even more fun with Bazel, where the host might be macos but the execution environment might be linux and the target os is windows. https://github.com/aiuto/bazel_samples/tree/main/utf8