Published 10/23/2023
How to create a custom memory allocator in Rust
By Asher White
Rust is a hugely popular low-level language praised for its approach to speed and memory safety. However, both those factors are hugely dependent on the memory allocator used. Normally, Rust uses the system allocator, which doesn’t always have the best performance. But, since 2018, programmers have been able to set a custom global allocator. This blog post will explore how to do that and how to write a completely custom allocator for your program’s needs.
Setting a custom allocator
Setting a custom allocator in Rust is surprisingly easy: you just need a static that implements the GlobalAlloc trait and add the #[global_allocator]
attribute. Like this:
use rsbmalloc::RSBMalloc;
#[global_allocator]
static ALLOCATOR: RSBMalloc = RSBMalloc::new();
What are some good global allocators? You can look at the #allocator tag on crates.io. My personal favourite is snmalloc-rs, which provides bindings to the high-performance snmalloc allocator from Microsoft. There are also popular Rust bindings to jemalloc and mimalloc. Or, if you’re looking for a decently fast allocator written in pure Rust, there’s the allocator that will be explored in this blog post: rsbmalloc. Selecting any of these allocators can bring large performance benefits for allocation-intensive applications: especially snmalloc for multi-threaded applications.
How a custom allocator in Rust works
What if a pre-existing allocator doesn’t fit your needs? Because Rust allows you low-level control over memory, you can absolutely write your own allocator. So—what does an allocator have to?
Computer memory comes as a huge array of bits organized into groups of eight (bytes). Each byte is identified by a usize
: the pointer. By itself, that’s useless for a program unless the only memory it needs is one massive byte vector. So, that’s what an allocator is for: it splits up an array or arrays of bytes into sections that can be used for any kind of variable, regardless of size or type.
If you’re writing a user-space program, you don’t have access to the entire physical memory, otherwise your program could read and write to other processes’ memories. Instead, your program has a virtual address space. That means the pointers in your program don’t refer directly to physical memory, they have to be looked up in a table that maps them to the actual physical location. If you try to read a pointer address that isn’t in this table (called the page table) for your program, you’ll get a segfault or similar error.
How do you get into the page table? A page is a chunk of memory, often 4KiB or 16KiB, that the OS gives out to a program. You create a page with the mmap
function from libc, which maps a virtual address range to a particular page. (Technically, the physical page only gets allocated with the first use of that address range, not the mmap
call.) You can also use sbrk
to try to adjust the size of the memory your program already has, but this is less flexible and this blog post will focus on using mmap
.
How about to get rid of memory? You can call the munmap
libc function, which does what it says—it unmaps those virtual addresses from the pages, letting other programs use the pages.
Believe it or not, that’s all you need to make your first Rust allocator. To use an allocator in Rust, it needs to implement the GlobalAlloc trait, which has two required functions: alloc
, dealloc
. There’s also realloc
or alloc_zeroed
that you can choose to implement if it fits your use case.
Writing a page allocator
The simplest allocator allocates entire pages—on its own this is inefficient but it’s a great starting-place for other allocators. The first step is finding your page size, like so:
(There will be a few snippets in the blog, but the full code of my custom Rust allocator is on GitHub)
use lazy_static::lazy_static;
lazy_static! {
pub static ref PAGE_SIZE: usize = page_size();
}
#[cfg(unix)]
fn page_size() -> usize {
#[cfg(target_os = "macos")]
unsafe {
libc::vm_page_size
}
#[cfg(not(target_os = "macos"))]
unsafe {
libc::sysconf(libc::_SC_PAGESIZE) as usize
}
}
#[cfg(windows)]
fn page_size() -> usize {
unsafe {
let mut info = core::mem::zeroed();
libc::GetSystemInfo(&mut info);
return info.dwPageSize as usize;
}
}
Note: I tried to include Windows equivalents when I could, but I don’t have a Windows machine so it isn’t tested.
So, on macOS, the page size is just a constant: vm_page_size
; on Linux or Windows you have to call a function, so that’s why I cached it with lazy_static
.
After that, you can write your alloc
function in the GlobalAlloc
impl block:
unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
let aligned_layout = match layout.align_to(max(layout.align(), *PAGE_SIZE)) {
Ok(l) => l.pad_to_align(),
Err(_) => return ptr::null_mut(),
};
#[cfg(windows)]
{
let addr = libc::VirtualAlloc(
ptr::null_mut(),
aligned_layout.size(),
libc::MEM_COMMIT | libc::MEM_RESERVE,
libc::PAGE_READWRITE,
);
addr as _
}
#[cfg(unix)]
{
let addr = libc::mmap(
ptr::null_mut(),
aligned_layout.size(),
libc::PROT_READ | libc::PROT_WRITE,
libc::MAP_PRIVATE | libc::MAP_ANONYMOUS,
-1,
0,
);
addr as _
}
}
First, you need to round the size of the allocation up to the nearest multiple of the PAGE_SIZE
, then you can just go ahead and call libc::mmap
. There a few flags that need to be added: PROT_READ
and PROT_WRITE
because you want to be able to read and write to this memory, MAP_PRIVATE
and MAP_ANONYMOUS
because the memory is private to this process and it doesn’t represent a file stored in memory. The -1
and 0
as the next arguments mean the same thing: it’s not a file in memory. There’s a few more things you can do with mmap
, you can read man 2 mmap
for more information.
dealloc
(unmapping memory) is even easier:
unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
if let Ok(aligned) = layout.align_to(max(layout.align(), *PAGE_SIZE)) {
#[cfg(windows)]
libc::VirtualFree(ptr as _, aligned.pad_to_align().size(), libc::MEM_RELEASE);
#[cfg(not(windows))]
libc::munmap(ptr as _, aligned.pad_to_align().size());
}
}
That’s it for a basic custom allocator. You could add some custom realloc
logic, but you already have a complete, fully-functioning custom allocator for Rust. The only problem—it’s hugely inefficient and slow because it maps entire pages for every allocation.
Writing a binned allocator
How do you make it faster? Every allocator answers that question a different way. For an in-depth look at a state-of-the-art allocator, you can read the white paper for snmalloc. The allocator I’ll demonstrate in this blog will be far simpler but still have decent performance in many situations.
The approach I took is called a binned allocator, where the allocator takes entire pages and devotes them to a certain size of allocation. For example, on one 16KiB page, you can accept 64 256-byte allocations—but you only have to ask the OS for memory once. There are many ways to make an allocator faster, but a rule of thumb for starting out is just to minimize the number of mmap
calls you make: anything involving requesting new memory from the system can be slow.
In a binned allocator, a bin is a region of memory that contains a number of slots that are all the same size. A slot, from the allocator’s perspective, is a union
type that can be one of three things: a buffer containing program data if the slot is being used or, if free, a pointer to the next free slot or a null pointer if this is the last free slot.
This is what the layout of a bin looks like:
struct Bin<S: Slot> {
free_head: Mutex<FreeList<S>>,
page: Mutex<Slice>,
}
For the purposes of the allocator, the Slice
type is just a struct containing a pointer and a length, and represents any uninitialized space on the page. The underlying fields need to be wrapped in a Mutex
because there is only one allocator for all the threads in the program—later on I’ll show how to do this more efficiently. The FreeList
type is just a pointer to the next free Slot
, or a null pointer if there is are no free slots.
Since every Slot
type has it’s own size and alignment, I used a macro to create all the sizes I needed. So, this is what the structure of the entire allocator looks like:
#[derive(Default)]
pub struct RSBMalloc {
#[cfg(not(feature = "std"))]
bins: Bins,
}
#[derive(Default)]
pub(crate) struct Bins {
pub(crate) bin4: Bin<Slot4>,
pub(crate) bin8: Bin<Slot8>,
pub(crate) bin16: Bin<Slot16>,
pub(crate) bin32: Bin<Slot32>,
pub(crate) bin64: Bin<Slot64>,
pub(crate) bin128: Bin<Slot128>,
pub(crate) bin256: Bin<Slot256>,
pub(crate) bin512: Bin<Slot512>,
pub(crate) bin1024: Bin<Slot1024>,
pub(crate) bin2048: Bin<Slot2048>,
pub(crate) bin4096: Bin<Slot4096>,
pub(crate) bin8192: Bin<Slot8192>,
pub(crate) bin16384: Bin<Slot16384>,
pub(crate) bin32ki: Bin<Slot32Ki>,
pub(crate) bin64ki: Bin<Slot64Ki>,
}
For rsbmalloc
, I used 15 bins that were each twice the size of the previous one—this worked for me, but you could experiment with the size and number of bins for your use case.
So, once rsbmalloc
receives an alloc
call, it takes the requested size and matches it against the corresponding list of bins, then passes on the alloc request:
unsafe impl GlobalAlloc for RSBMalloc {
unsafe fn alloc(&self, layout: core::alloc::Layout) -> *mut u8 {
let size = layout.pad_to_align().size();
match size {
..=4 => self.bins.bin4.alloc(),
..=8 => self.bins.bin8.alloc(),
..=16 => self.bins.bin16.alloc(),
..=32 => self.bins.bin32.alloc(),
..=64 => self.bins.bin64.alloc(),
..=128 => self.bins.bin128.alloc(),
..=256 => self.bins.bin256.alloc(),
..=512 => self.bins.bin512.alloc(),
..=1024 => self.bins.bin1024.alloc(),
..=2048 => self.bins.bin2048.alloc(),
..=4096 => self.bins.bin4096.alloc(),
..=8192 => self.bins.bin8192.alloc(),
..=16384 => self.bins.bin16384.alloc(),
..=0x8000 => self.bins.bin32ki.alloc(),
..=0x10000 => self.bins.bin64ki.alloc(),
_ => PAGE_ALLOCATOR.alloc(layout),
}
}
}
Whichever bin gets the alloc
call then tries three things:
- Check its free list. If there is a free slot, it immediately returns it
- If not, it checks its
page
field to see if there is any unused space from the initial allocation - If not, it requests another page from the OS
impl<S: Slot> Bin<S> {
fn add_one(&self) -> *mut S {
let slot_size = mem::size_of::<S>();
let mut page = self.page.lock();
if !page.ptr.is_null() {
if page.len >= slot_size {
let ret = page.ptr as *mut S;
unsafe {
page.ptr = page.ptr.add(slot_size);
page.len -= slot_size;
}
return ret;
}
}
unsafe {
let ptr = PAGE_ALLOCATOR.alloc(Layout::from_size_align_unchecked(
RSB_CHUNK_SIZE,
mem::align_of::<S>(),
));
let ret = ptr as *mut S;
page.ptr = ptr.add(slot_size);
page.len = RSB_CHUNK_SIZE - slot_size;
ret
}
}
/// Allocates a pointer with size SIZE
unsafe fn alloc(&self) -> *mut u8 {
let mut free_head = self.free_head.lock();
if free_head.exists() {
let buf = free_head.get_buf();
(*free_head) = free_head.get_next().into();
buf
} else {
drop(free_head);
(*self.add_one()).buf()
}
}
}
The deallocation process is similar: the size of the deallocation is mapped to a bin, then the bin takes the pointer that gets passed, casts it to a Slot
pointer. Then it takes the Slot
and sets it to a pointer to the next free slot (free_head
) and then assigns free_head
to the slot pointer.
unsafe fn dealloc(&self, ptr: *mut u8) {
let slot_ptr = ptr as *mut S;
let mut free_head = self.free_head.lock();
(*slot_ptr).set_next((*free_head).option_nn());
(*free_head) = FreeList::from(slot_ptr);
}
Reallocation is just a combination of allocation and deallocation along with a copy and a check for pre-existing space.
There are a few downsides to this kind of binned allocator. For one thing, pages for any allocation under 64 KiB are never released back to the OS, they stay for the duration of the program in the name of performance. This version is also written from a completely single-threaded perspective, so all the mutexes will make it slow down really fast if there’s more than one thread. For another thing, if an allocation or reallocation doesn’t fit well inside a slot there could be a lot of wasted space, i.e., if the program allocates a lot of 129-byte objects, they’ll each take up an entire 256-byte container. And, allocations larger than 64KiB are passed immediately to the OS—because you’d want those allocations to be freed back to the system after use. But, if the program allocates memory, frees it and then makes another large allocation a lot, this design could slow it down significantly.
But—the whole point of writing your own allocator is that you can adapt it to your program. If keeping memory usage very low at the cost of performance is an issue and you make and free a lot of small allocations, then you can release pages back to the system. Or, if your program makes a lot of 129-byte allocations, you can add a 129-byte bin. This ability to tailor your allocator to your program is the whole point of writing your own, and it could be allow you to improve performance.
Another caveat is that this allocator makes no effort to improve security by randomizing allocations or anything like that. If that’s a concern for your program, it would be better to use an established allocator with hardened security, like snmalloc, instead of trying to write your own.
However, something you can add to this basic allocator is a better multi-threading system.
Writing a multi-threaded allocator with thread caches
In theory, the allocator I just wrote is multi-threaded: it won’t fail, and memory safety will be preserved. But, it will slow down, and depending on the number of threads, become completely unusable.
Why? There’s only one static list of the bins. And, since every allocation under 64KiB needs to edit the bins in some way, everything has to wrapped with mutexes. But, the stdlib Mutex
type makes heap allocations, so it can’t be used inside the allocator that makes those heap allocations. That’s why I used the Mutex
type from the spin
crate. The downside of that is any thread waiting for the Mutex just spins, taking up 100% of the CPU. So, if you have a lot of concurrent allocations, the threads will gridlock and slow each other down.
What can be done? There are a few very interesting approaches, but rsbmalloc
takes one of the simpler ones: thread caches. This approach means every thread (generally) has its own set of bins, so there’s no lock contention.
However, there’s no way to get the allocator to run it’s own code whenever a thread is spawned. Instead, it uses each thread’s unique ID to map it to a section of bins. So, a set of bins could in theory be reused if there are enough threads, so that’s why everything is still wrapped in mutexes. However, in practice, there are enough sets of bins that there is rarely contention for the locks, so there’s little slowdown.
The downsides of this approach are that free slots aren’t shared across threads, so if items are freed on a thread other than the one they’re allocated on you’ll end up with a lot of free memory that doesn’t get used, and it’s not the absolute fastest approach either.
But, it works as a proof-of-concept.
The allocator initializes by pre-creating a bunch of thread caches (8 times the number of logical cores) and putting them on the heap. Since this only needs to happen once, it’s okay to use the underlying page allocator directly. Then, whenever alloc
is called, the thread ID is hashed and modulo-ed by the number of thread caches, mapping that thread ID to a thread cache.
pub(crate) struct ThreadCache {
pub bins: OnceCell<BinsSlice>,
}
pub(crate) struct BinsSlice {
pub ptr: *mut Bins,
pub len: usize,
}
unsafe impl Sync for BinsSlice {}
unsafe impl Send for BinsSlice {}
impl ThreadCache {
unsafe fn get_thread_cache<'a>(&'a self, id: usize) -> &'a mut Bins {
let bins_slice = self.bins.get_or_init(init_bins);
let hashed = hash_usize(id);
let offset = (hashed % bins_slice.len) as isize;
&mut *bins_slice.ptr.offset(offset)
}
}
fn hash_usize(input: usize) -> usize {
let mut output = input as u64;
output ^= output >> 33;
output = output.wrapping_mul(0xff51afd7ed558ccd);
output ^= output >> 33;
output = output.wrapping_mul(0xc4ceb9fe1a85ec53);
output ^= output >> 33;
output as usize
}
#[cfg(unix)]
pub(crate) fn thread_id() -> usize {
unsafe { libc::pthread_self() }
}
#[cfg(windows)]
pub(crate) fn thread_id() -> usize {
unsafe { libc::GetCurrentThreadId() as usize }
}
After that, the alloc
, dealloc
and realloc
functions have one small change made to them:
let bins = self.thread_cache.get_thread_cache(thread_id());
match size {
..=4 => bins.bin4.alloc(),
// ...etc.
}
Making these changes will drastically improve your multi-threaded performance compared to the previous version.
Debugging an allocator in Rust
A note about debugging: I never used lldb
or gdb
before working on this project, but they are absolutely indispensable for this kind of program. Normally, I’d just use println!()
and panic!()
for inspecting variables and debugging. But here’s the thing: both those macros call functions that allocate on the heap (they both have format strings). So, if your allocator panics or doesn’t work, those macros are useless. I personally used lldb
extensively to inspect code paths, look at variables along the way and see the cause of panics.
Developing an allocator, you’ll see a lot of SIGABRT
, SIGSEGV
, SIGKILL
and SIGILL
signals. These don’t provide the nice backtrace and explanation that Rust normally provides, so you need a debugger to look at the stack trace, which can help you find the problem. For example, after adding 64KiB bins, I was seeing a SIGABRT
because there was a panic inside a panic. Running cargo test
on its own told me nothing more, but running the test executable through lldb
showed me that the panic being called was about an unaligned pointer dereference, which let me fix the problem.
This blog post has provided a basic introduction to custom memory allocators in Rust. These can be a valuable way to hone performance for your particular use case, or you could consistently use a faster allocator like snmalloc
to improve allocation speed in any program. Doing so will let you improve performance without giving up Rust’s memory safety guarantees.