Published 03/05/2023
Optimizing a JSON parser in Zig
By Asher White
In the fast-changing computer world, 50 years seems like an eternity. And yet, the C programming language, developed in the early 70’s, is still the standard language for low-level, close-to-the-hardware computing. The vast majority of new languages are far easier to code in than C, but that ease of use often comes at the cost of some flexibility or performance. In the last decade, Rust has made huge advances in capability and popularity and is often considered the successor to C++. But what about the step closer to the hardware? What language should be used when you need direct, fine-grained control over memory allocations and extremely low overhead? Until recently, C was the only language widely used for that. In the past few years, however, there have been several new languages released designed to replace C, such as Zig, Odin, V, Beef and more.
Of these, Zig caught my interest and I decided to learn more. Part of this is because of how easy it is to cross-compile with Zig—even code that isn’t Zig. C, C++ and even Rust can be properly linked for virtually any host platform just using one single Zig executable and the language’s compiler, without needing separate toolchains or sysroots. Another reason Zig interested me is its extremely high performance, and how it is already being used in projects like Bun, with impressive performance results. So, I decided to learn Zig, and to hone my skills I wrote and optimized a JSON parser. There is already a JSON parser built into the standard library, but I started with the goal of just learning how to implement a fast program in Zig.
JSON, loosely based on JavaScript syntax, is a text-based data-interchange format that’s heavily used in web development and many other fields. It has a simple syntax that’s very well explained on json.org. There are 6 data types: Object, Array, String, Number, Boolean (represented by the atoms true
or false
) and null
, so these can all be held in an enum JsonValue
.
(I’ll show some excerpts in this post, but visit the Github repository for the full code)
pub const JsonValue = union(enum) {
Object: *JsonObject,
Array: *[]JsonValue,
String: *[]u8,
Number: f64,
Boolean: bool,
Null,
};
There are only two kinds of data structures in JSON: objects or arrays. Each of these can hold different values, so there are three states the parser can be in: looking for the next value, inside an object or inside an array.
const ParseState = enum { Value, Object, Array };
When the parser is in .Value
mode, it will just look for and return the next whole value (i.e., if the next value is an object, it will recurse until the whole object can be returned). When the parser is in .Object
mode, it will look for key-value pairs separated by commas, and when it is .Array
mode it will look for several values separated by commas.
So, I started by implementing the .Value
mode in the parseNext
function. It takes the next character of the JSON string (or the next 4/5 if it checks for the atoms true
, false
or null
), and runs a series of tests to see how to parse it. If it is a value type (String, Number, Boolean or Null), it just reads it and returns it, but if it is a structure type (Array or Object), it calls parseNext
again with the right mode.
switch (state) {
.Value => {
if (char == '"') {
var string = try allocator.create([]u8);
errdefer allocator.destroy(string);
string.* = try readString(json, allocator);
return .{ .String = string };
}
if ((char >= '0' and char <= '9') or char == '-') {
return .{ .Number = try readNumber(json) };
}
if (char == '{') {
json.ignoreNext();
return try parseNext(json, .Object, allocator);
}
if (char == '[') {
json.ignoreNext();
return try parseNext(json, .Array, allocator);
}
if (json.peekMany(4)) |next_4| {
if (mem.eql(u8, &next_4, "null")) {
_ = json.take(4) orelse unreachable;
return JsonValue.Null;
}
if (mem.eql(u8, &next_4, "true")) {
_ = json.take(4) orelse unreachable;
return .{ .Boolean = true };
}
if (json.peekMany(5)) |next_5| {
if (mem.eql(u8, &next_5, "false")) {
_ = json.take(5) orelse unreachable;
return .{ .Boolean = false };
}
}
}
return ParseError.ExpectedNextValue;
},
}
To efficiently work through the json buffer (it’s just a slice of u8
, there is no string type in Zig), I wrote a SliceIterator
that takes a slice and lets you increment through it either with copies (peekCopy
and peekMany
methods), or by incrementing the pointer (next
, take
, ignoreNext
and ignoreMany
methods).
To actually parse the values, the Zig standard library was very useful. Because f64
is the de-facto standard for JSON numbers, the std.fmt.parseFloat
function was useful for properly parsing floating point numbers, taking into account edge cases where the number starts to lose precision. Also, because JSON strings can include escaped UTF-16 codepoints, but the files themselves are UTF-8-encoded, the std.unicode.utf16leToUtf8
function helped to properly handle \uXXXX
escape sequences in JSON strings.
The first working implementation I had was brutally slow. It took 13 seconds to parse a large file (around 17 MB), using over half a gigabyte of memory. That’s the downside of working in a very low-level language, the most obvious way to do something might be extremely slow, but sometimes there’s a quick fix. I inspected the program with the Intruments app that comes with XCode, and I saw that I was making over 700,000 memory allocations. That’s a crazy lot for a 17 MB file. I was using the GeneralPurposeAllocator
that’s implemented in Zig, so I could just to switch to the std.heap.c_allocator
that wraps libc’s malloc
. The switch cut the runtime down to 2.8 seconds with a memory usage of 228 MB. Way better, but still very slow.
What could I do to cut down on the number of memory allocations? There were simply a lot of fields in the JSON document, and since they weren’t known at compile-time, they had to be allocated on the heap.
This is where Zig really shines as opposed to a higher-level language. In Zig, just in the standard library, there are 10 different allocators (as of version 0.10.1). For this use case—i.e. a lot of allocations that all belong to one document—an arena allocator (std.heap.ArenaAllocator
) was perfect. An arena allocator works by having a growable list of buffers where stuff gets allocated, but items can’t be individually freed, everything can only be freed all at once. This is perfect for a JSON document because it keeps everything together, drastically reducing the total number of allocations to system memory, but without wasting memory because no memory needs to be freed during parsing (there aren’t any intermediate representations that need to be stored on the heap and then freed). When the user is done with the document, they can free it all at once. For zig-json
, the number of allocations decreased so much that I could switch back to using the GeneralPurposeAllocator
behind an ArenaAllocator
, without any performance cost. After switching to ArenaAllocator
, performance improved to about 0.15 seconds—an 86x speedup compared to my first implementation, just by changing how it allocated the memory. Memory usage, unfortunately, was still high, around 250 MB for a 17 MB file.
When I profiled the code now, I saw that most of the compute time was spent parsing numbers (this particular file that I was benchmarking against had an unusually large amount of decimal numbers). After optimizing my number parsing as much as I could and comparing it to the standard library std.fmt.parseFloat
, I decided to just switch completely to the standard library float parsing function. My implementation was slightly faster most of the time, but when dealing with very large or very small numbers it lost accuracy too soon. There are a lot of corner cases with number parsing, and their implementation was more thorough than mine was going to be. So, this is what my number-parsing function ended up looking like:
fn parseUnsignedNumber(json: *SliceIterator(u8)) ParseError!f64 {
var num_buf: [320]u8 = undefined;
var num_length: usize = 0;
while (json.peekCopy()) |char| {
switch (char) {
'0'...'9', '.', 'e', 'E', '-', '+' => {
if (num_length >= num_buf.len) return ParseError.InvalidNumberLiteral;
num_buf[num_length] = json.next() orelse unreachable;
num_length += 1;
},
else => break,
}
}
const number = std.fmt.parseFloat(f64, num_buf[0..num_length]) catch return ParseError.InvalidNumberLiteral;
return number;
}
Another place I could improve performance was in the string-parsing function. While the document is parsing, the buffer with the JSON string in it is in heap memory, and I was working through it byte-by-byte. However, this wasn’t very efficient, because most string bytes would just get copied to the output buffer without modification. To speed this up, I made the string parser work on 8-byte chunks (8 bytes is the native pointer size on most modern CPUs, so they can handle this size very quickly). Before working through this chunk byte-by-byte, I checked the whole chunk. If every byte in the chunk was neither \
(the escape character) or "
(the string delimiter), it could be copied en masse into the output buffer. Otherwise, it would work through the chunk byte-by-byte escaping characters or ending the loop as necessary.
while (json.peekMany(8)) |chars| {
// ASCII optimization
if (chars[0] != '\\' and chars[0] != '"' and chars[1] != '\\' and chars[1] != '"' and chars[2] != '\\' and chars[2] != '"' and chars[3] != '\\' and chars[3] != '"' and chars[4] != '\\' and chars[4] != '"' and chars[5] != '\\' and chars[5] != '"' and chars[6] != '\\' and chars[6] != '"' and chars[7] != '\\' and chars[7] != '"') {
try string.appendSlice(&chars);
json.ignoreMany(8);
continue;
}
// ...
}
For a 44 MB ASCII-heavy JSON file, this resulted in a speedup from 0.22 seconds with ~145 MB of memory to 0.9 seconds and ~105 MB—more than twice as fast, and with 40 MB less memory!
Memory usage was still quite high though. After inspecting my code, I realized this was partly because of my JsonValue
enum. Originally, I had written it like this:
pub const JsonValue = union(enum) {
Object: JsonObject,
Array: ArrayList(JsonValue),
String: []u8,
Number: f64,
Boolean: bool,
Null,
}
But, that meant that every instance of the enum had to be big enough to fit the biggest possible field. So, every single JsonValue
, including numbers, booleans and even null, took 48 bytes. By switching JsonValue
to use pointers for anything bigger than 8 bytes, I reduced that to 16 bytes, bringing memory usage for parsing the 17 MB file down to 94 MB from around 250.
By the end, my optimized parser was about twice as fast as the standard library parser for the limited use case of parsing a JSON file to a generic data structure (i.e., not a particular Zig structure), and it also reliably used less memory. Working on this parser helped me to learn about high-performance computing and the Zig programming language, which I plan to continue to use any time memory allocation is an issue. You can check out the full source code on Github, and any suggestions for more optimizations would be welcome.