Alex Saveau

The need for new instructions: atomic bit fill and drain

2025-09-24T00:00:00+00:00

In the previous article, I proposed a new lockless channel architecture based on two bit vectors. One is used to reserve access to a memory region, the other to commit changes. I showed promising theoretical performance improvements: whereas existing architectures pay the cost of linearizability without clients being able to take advantage of such consistency, lockless bags allow each element in the channel to be operated upon fully independently. In theory.

In practice, hardware processes atomic operations by locking cache lines. Readers and writers in lockless bags need to modify both bit vectors, which means locking both cache lines and neutering any performance benefits offered by lockless bags. How can we fix this?

Hardware needs to give software an efficient mechanism for locking memory regions.

New instructions: `atomic_bit_{fill,drain}` and `atomic_bit_set`

Here are the proposed signatures:

atomic_bit_{fill,drain}: {cache line address, max # bits} => {changed bit mask}
atomic_bit_set: {cache line address, bit mask} => {}

Given a cache line and a maximum number of bits M to fill, atomic_bit_fill scans the cache line for zeros and flips at most M of them to ones. It returns the mask of flipped bits. atomic_bit_drain operates identically except that it scans for ones and returns them to zero. These instructions could operate on machine words, but the more bits you have the deeper your channel queue can be. I assume it wouldn’t be too difficult to return 512-bit change masks and use AVX-512 instructions to compute indices out of the mask.

How do you use these instructions?

The concepts from lockless bags apply mechanically:

To prepare M elements, call atomic_bit_fill on the reservation bit vector.
Use the returned mask to write your elements into their corresponding array slots. An empty mask means the channel is full.
Once the elements have been written, call atomic_bit_set to update the commit bit vector.

Receive elements by using atomic_bit_drain instead.

Why are these instructions more powerful than existing atomics?

atomic_bit_{fill,drain} offer novel acceleration opportunities: the key idea is that cores may issue requests without knowing the state of the bit vectors. This bears repeating: cmpxchg and atomic bitwise instructions require knowing the state of the word being operated upon prior to modification. On the other hand, atomic_bit_{fill,drain} instructions do not need to know the state of the word and therefore do not need to own the memory being operated upon, meaning cores can issue fill/drain requests in parallel.

Put another way, cores never need to pull the cache lines down. A synchronization point to process incoming requests remains, but in the time it takes the on-chip network to return the resulting bit mask to the originating core, other requests may have been issued and serviced, changing the state of our bit vector.

We’ve found a way to let cores communicate their ownership over memory regions concurrently! Global linearizability has been vanquished at last: operations over a particular element in the bit vectors are linearized, but each element’s operations are fully independent.

That’s not all! We have a poor man’s Intel Dynamic Load Balancer, but that doesn’t mean we can’t get a little fancy:

The L1 data pre-fetcher could intelligently start loading the corresponding cache lines of the array slots we claimed when executing an atomic_bit_{fill,drain} instruction.
The CPU could move the bit vector cache lines around dynamically based on load. For example the lines could be in L3 if most cores are using it, or moved into the L2 of a core near the main sources of request traffic.
- Taking this idea further, the 512 bits could be dynamically partitioned by the CPU to local core complexes. Say your chip is organized such that groups of cores have faster means of communication amongst themselves than going to L3. Each group could get ownership over a slice of the 512 bits and fall back to asking other groups for bits only if the local partition is full/empty. Perhaps more interesting is partitioning across NUMA nodes.
Cores could develop affinities/flows on a best effort basis. Say core A is frequently publishing data while another core B is frequently consuming data. The server handling fill/drain requests could learn this pattern and try to ensure B gets masks primarily containing A’s data to potentially improve locality.

The nitty-gritty

The instruction signatures I proposed above aren’t quite enough for a fully featured MPMC channel.

To implement sleeping on an empty or full channel, hardware support is required (unless the fill/drain instructions are implemented on 32-bit words). A sleeping bit will need to be reserved somewhere in the cache line. The atomic_bit_set instruction will need to support setting the sleeping bit atomically with the changed mask and return the previous state of the sleeping bit.

The fill/drain instructions also need to return a dead bit so channel disconnects can be handled properly.

Both of these bits can be placed at the end of the cache line and normal atomic bitwise operations need to be supported on the last 32 bits of the cache line so these bits can be updated as needed and used as a futex.

Additionally, the atomic_bit_set instruction doesn’t make sense as is: set the bits to what? So there could be two variants of the instruction, one to set the mask bits to zero and another to set them to one. Or the instruction could be expressed in terms of OR and AND operations, where the change mask is inverted when passed into the AND variant of the change mask update instruction.

Finally, atomic_bit_set should support acquire/release memory orderings.

MPSC channel support

To support MPSC channels, additional hardware support is required. I couldn’t figure out a good way to maintain independence between streams, so we return to global linearizability by implementing software lockless queues in hardware. The hardware can reserve some bits to represent head/tail pointers in each bit vector’s cache line, marking where the next bits should be filled/drained. The atomic_bit_drain instruction would only be allowed to return a non-empty mask when the head points to filled bits. Note that this approach incurs the same problems as described in the previous article around unfortunately timed context switches blocking the entire channel.

However, hardware acceleration maintains the advantage of supporting concurrent requests to add elements to the channel.

Recap

Existing lockless channels can only operate as fast as the hardware can move a cache line between cores, which can be terribly slow. This novel approach proposes a stateless hardware acceleration method for lockness channels: with it, cores can independently and concurrently request modification to the lockless channel. In essence, the new approach deeply pipelines updates while existing channels serialize updates.

Please reach out if you’re building something like this, I’d be very interested in providing the software implementation.

Your MPSC/SPMC/MPMC queue is not a queue

2025-08-16T00:00:00+00:00

Lockless queues let multiple cores communicate with each other without mutexes, typically to move work around for parallel processing. They come in four variants: {single,multi}-producer {single,multi}-consumer. A producer gives data to a consumer, each of which can be limited to a single thread (i.e. a single-{producer,consumer}) or shared across multiple threads. But only the single-producer single-consumer (SPSC) queue is actually a queue!

This article is part of a series building Lockness, a high-performance blocking task executor.

SPMC/MPMC queues are broken

Consider the so-called SPMC queue. By definition, received messages cannot be processed in a total global order without additional external synchronization. First, the single, ordered input stream is arbitrarily split amongst each consumer. Then, each message is removed from the queue in the same order. But from this moment onwards, the consumer thread can be paused at any moment (even within the library implementation that is still copying the data into your code). Consequently, another thread can process an element from the future before your code has even had a chance to see that it claimed an element, now from the past.

Acausality within consumers is the only upheld invariant: a consumer will not see any elements prior to the last element it has seen. This guarantee is almost certainly too weak to be useful as consumers have no control over which set of elements are seen, meaning arbitrary elements from the future may have been processed on other threads.

An example SPMC queue where element B is processed on thread 2 acausally before A on thread 1

Similar logic applies to MPMC channels with the additional weakening that different producer streams are processed in no particular order. To work around this, some implementations use many SPMC channels to make a MPMC channel. They introduce the concept of a token which lets consumers optionally choose a specific producer to consume. Were this token to guarantee exclusive access to the producer, you’ve just created a poor man’s SPSC queue. Without exclusivity, you get all the same problems as SPMC channels (items being processed out-of-order by other threads).

MPSC queues are special

While additional synchronization can be applied on top of SPMC and MPMC channels to provide ordering guarantees, the more useful abstraction is a stream. MPSC channels are special in that each producer can be thought of as its own stream, even though no ordering guarantees are provided between streams (producers). The consumer will see each stream in order with interleavings between streams. In hardware terms, it’s a multiplexer.

Consumer threads can then be set up for specific purposes. For example, Ringboard uses two consumer threads following the actor model in its UI implementations. Any thread can request state changes and/or submit view updates, but state changes and view updates are each processed serially on their own threads. Since I only have two consumer threads, this is effectively a mini model-view-controller framework: the controller thread handles model updates and the main thread updates the view. Notice that order within streams is important: the controller should process user input in the order in which actions occurred. However, other updates (i.e. other streams/producers) like an image loader having finished retrieving an image from a background thread can be interleaved arbitrarily with the user input stream.

Thus, MPSC channels as a whole aren’t queues, but each producer is its own queue which provides useful guarantees.

The rundown

To summarize, SPMC queues and by extension MPMC queues don’t have useful ordering guarantees—calling them queues is silly. MPSC queues can be thought of as a set of producer queues multiplexed together.

Note: I’ve left SPSC queues out of this discussion because they are real queues with a generally agreed upon optimal implementation: power-of-2 queue capacity backed by duplicated mmaps with cached head/tail pointers expressed in terms of elements written/read, and optional get_robust_list support to handle multiprocess shared memory dead counterparty notification.

Lockless queues are slow

Lockless queues are so named by virtue of being implemented with a queue, namely circular buffers or variations on linked lists. This is problematic because the queue linearizes updates to the channel where no such global ordering can be observed as explained above.

Blocked producers and consumers in a MPMC channel, due to linearization

The implementation of a lockless queue can be conceptualized through four pointers:

The tail: producers increment it to reserve a slot to work within.
- Slots in red are being written to without interfering with consumers.
The committed pointer: producers have finished writing to all slots past this offset.
- Slots in orange should be ready to be consumed, but the shaded slot hasn’t finished its write, thereby blocking consumers from accessing subsequent elements.
The head: consumers increment it to claim a slot for consumption.
- Slots in green are ready to be read without interfering with producers.
The consumed pointer: consumers have finished reading all slots past this offset.
- Slots in blue should be ready to be written to, but the shaded slot hasn’t finished its read, thereby blocking producers from writing to subsequent elements.

This approach is not wait-free: a context-switched producer or consumer in the middle of writing or reading a value will prevent further progress.

Lockless algorithm fundamentals

The core problem in lockless algorithms is mediating access to shared memory. SPSC queues have it easy: they can prepare work and only commit it once they’re ready. Once you allow multiple threads to compete for the ability to access the same memory, they must go through stages:

A thread must exclude other competitors accessing a chunk of memory.
The thread uses the memory (non-atomically). This stage should be as fast as possible and is typically just a memcpy.
The thread commits (publishes) its change.

Producers reserve memory to publish to consumers while consumers claim memory to be read and then released back to publishers.

Lockless bags as a new approach

We’ve established that queue-based lockless channels pay the cost of linearizability without being able to take advantage of it. We’ve also seen that the only true requirement for a lockless channel is the ability to lock a region of memory.

Instead of a queue, let’s use a bag! What’s a bag? Well, uhhhh… It’s a bag. You can put stuff in, rummage around, and take stuff out. Notice that I said nothing about what you get out—if it’s in the bag, it’s a valid item to be taken out at any time (i.e. in an unspecified order).

The fastest single-threaded bag implementation is of course a stack. But this is the multithreaded world, so let’s instead use an array and two bitvectors. The first bitvector will be our reservations: producers atomically set bits to gain exclusive access to the corresponding array slot. The second bitvector is the list of committed slots: once producers are done with a slot, they set its corresponding commit bit. Conversely, consumers unset commit bits to read the slot and unset the reservation bit to return the slot to producers.

Threads in a lockless bag independently control their slots

In this scheme, every producer and consumer operates independently. If a thread is stuck between stages, it has no effect on the progress of other threads. We have ourselves a wait-free MPMC channel!

You don’t need unbounded channels

Limitless anything doesn’t exist in the real world, as much as we love to pretend it does. Unbounded channels introduce a lot of complexity in an attempt to paper over poorly engineered systems. If consumers cannot keep up, producers must slow down. The best way to go about this is to apply backpressure, but sadly this is rarely an option. Dropping the messages is a possibility, though a distasteful one. Alternatively, producers can cheat and buffer messages locally while waiting for space to free up when consumers are full. This last approach is the one taken throughout Lockness, minimizing communication and therefore contention when it is most critical (the channel is overloaded).

For this reason (and let’s be real mostly because unbounded channels are hard), the lockless bags I’ve implemented are unconfigurably bounded.

But here’s an idea to support unbounded lockless bags

Make it a tree! Right now, the bitset represents data slots in an array, but it could additionally allow pointers to sub-nodes. This feels like a reasonable approach, though I haven’t thought too hard about a precise implementation.

Introducing Lockness Bags

Lockness Bags implement the ideas described above and might be used to build the Lockness Executor should benchmarking show an advantage.

Are Lockness bags fast?

No :(

Queue-based channel implementations win over bags on current hardware by disjointing their producer and consumer memory writes. Remember the consumed and committed pointers from the diagram? In practice, these are implemented in a distributed fashion: each slot has a flag that marks it as readable or writable. To start, all slots are writable and transition to readable and back as you write and read the slots. The head pointer can only advance if its current slot is readable and conversely for the tail pointer. Crucially, this means consuming a slot almost never touches a cache line producers are actively working with.

On the other hand, lockless bags are implemented with two atomics each representing a bitvector. To produce and consume a value, you must always update both bitvectors. This means producers and consumers are all contending over the same two cache lines. On the other hand, lockless queues limit contention for producers to the tail cache line and similarly consumers only contend on the head cache line.

The future: hardware accelerating lockless bags

The next article in the series explores the idea of novel instructions that would hardware accelerate lockless bags to significantly outperform all possible software channel implementations.

Appendix: alternative approaches and their pitfalls

This problem has been itching the back of my brain for close to five years now. As part of the journey, I’ve toyed with many different approaches that were rejected.

Why doesn’t a stack work for MPSC channels?

Instead of a circular buffer, couldn’t we use a stack? Producers would compete to place items at the top and the single consumer takes items down. Unfortunately, the consumer would need to block producers from raising the stack, otherwise the consumer could end up in a situation where it is trying to out an already stacked upon element. You can hack around this, but it doesn’t seem better than a queue.

Why not use many SPSC queues to make a MPSC channel?

Generalizing the question, why not use multiple stricter channels to build a weaker one? On the surface, this appears to be a straightforward solution, but you run into two problems:

Load balancing: it is difficult to share resources. For example, consider using many SPMC channels to make a MPMC channel. If one producer has a spike in load while the others remain quiet, there is no way to use the available capacity of the other producers’ channels.
Poor scaling with high core counts: either producing or consuming values must scale linearly with the number of threads to scan across the individual queues. That said, this can be worked around by developing affinities, e.g. a consumer can keep reading from the same producer if it always has values. But if your load is so well-balanced that consumers could just pair with producers, you may as well do that instead.

Additionally, orchestrating the addition/removal of individual queues in the channel and supporting sleeping becomes difficult.

Why are tunnel channels bad?

Tunnels are the simplest MPMC channel: they hold no values and thus require a pairing between producer and consumer to transfer a value onto the consumer’s stack. Consequently, either the producer or consumer must sleep to accept the next value, every time. This is painfully slow.

Why not store machine-word sized elements in the channel?

Instead of supporting arbitrarily sized values in the channel, what if we only accepted values that could fit in an atomic? More specifically pointers? Surprisingly, this doesn’t really help. If your only state is the array of atomic pointers, there’s no easy way to find free/filled slots. Thus, you need to go back to a circular buffer which has the same contention problems when the head/tail are updated but the slot hasn’t been atomically swapped to its new value. An alternative could be to scan the array for empty/filled slots until one is found, but under contention you’ll be fighting over the same slots.

Generalizing over mutability in Rust

2025-07-30T00:00:00+00:00

You’ve seen us generalize over buffer types, now let’s generalize over buffer mutability!

I tried to write something like this in a project I’m working on:

fn process_items<T>(buf: &<mut> [T], mut f: impl FnMut(&<mut> T)) {
    loop {
        ...
        f(&<mut> buf[i]);
    }
}

process_items abstracts over complexity retrieving items from buf and needs to support processing items both mutably and immutably. Unfortunately, this isn’t possible in Rust as far as I’m aware. I first tried solving the problem with a macro, but couldn’t figure out how to generalize over &buf[i] in one macro invocation and &mut buf[i] in another.

Thus, it is with great (dis?)pleasure that I present to you: trait magic! Feast your eyes:

Edit: a reader emailed me with a better solution.

trait Buf<T> {
    type F;

    fn do_(&mut self, f: &mut Self::F, i: usize) -> bool;
}

impl<T, F: FnMut(&T)> Buf<T> for (&[T], PhantomData<F>) {
    type F = F;

    fn do_(&mut self, f: &mut F, i: usize) -> bool {
        let (buf, _) = self;
        if i < buf.len() {
            f(&buf[i]);
            true
        } else {
            false
        }
    }
}

impl<T, F: FnMut(&mut T)> Buf<T> for (&mut [T], PhantomData<F>) {
    type F = F;

    fn do_(&mut self, f: &mut F, i: usize) -> bool {
        let (buf, _) = self;
        if i < buf.len() {
            f(&mut buf[i]);
            true
        } else {
            false
        }
    }
}

fn process_items<T, B: Buf<T>>(mut buf: B, mut f: B::F) {
    let mut i = 0;
    loop {
        if !buf.do_(&mut f, i) {
            break;
        }
        i += 1;
    }
}

fn main() {
    let foo = ["a", "b", "c"];
    process_items((foo.as_slice(), PhantomData), |item| println!("{item}"));
    println!();

    let mut bar = ["a".to_string(), "b".to_string(), "c".to_string()];
    process_items((bar.as_mut_slice(), PhantomData), |item| {
        *item = format!("{item}{item}")
    });
    println!("{bar:?}");
}

Here’s a playground link for your convenience.

The example is contrived, but I hope it gets the trick across. If anybody knows of a simpler solution, I’d be very happy to hear it. :)

Appendix: reader emailed suggestion

Courtesy of Moritz Borcherding, here is much nicer solution without the associated type or PhantomData:

trait Buf<T, F> {
    fn do_(&mut self, f: &mut F, i: usize) -> bool;
}

impl<T, F: FnMut(&T)> Buf<T, F> for &[T] {
    fn do_(&mut self, f: &mut F, i: usize) -> bool {
        if i < self.len() {
            f(&self[i]);
            true
        } else {
            false
        }
    }
}

impl<T, F: FnMut(&mut T)> Buf<T, F> for &mut [T] {
    fn do_(&mut self, f: &mut F, i: usize) -> bool {
        if i < self.len() {
            f(&mut self[i]);
            true
        } else {
            false
        }
    }
}

fn process_items<T, F, B: Buf<T, F>>(mut buf: B, mut f: F) {
    let mut i = 0;
    loop {
        if !buf.do_(&mut f, i) {
            break;
        }
        i += 1;
    }
}

fn main() {
    let foo = ["a", "b", "c"];
    process_items(foo.as_slice(), |item| println!("{item}"));
    println!();

    let mut bar = ["a".to_string(), "b".to_string(), "c".to_string()];
    process_items(bar.as_mut_slice(), |item| {
        *item = format!("{item}{item}")
    });
    println!("{bar:?}");
}