Friday, June 12, 2015

The difficulty of lock-free programming: a bug in lockfree stack

At $WORK we have some custom implementations of lock-free data structures, similar to (but not the same as) boost::lockfree.  The other day I fixed a bug in our lock-free stack, which was somewhat interesting.

The symptoms we were seeing were that items were both on the stack (indicating they were free) and also we being manipulated by a thread (indicating they were inuse).  In addition, looking at the list of items in the stack showed a cycle in the list, which should never happen with well-ordered code.

We spent several days looking at the surrounding code: could we have disposed of the object twice, thus corrupting the stack? Could the lock-free queue for managing the items when they were inuse but had resource contention have a bug (we'd already fixed one the previous week)? In the end we found the bug was in the stack code itself.

The lock-free stack at $WORK uses an intrusive list.  Like most lock-free stack implementations, you cannot delete the items popped from the stack, because the pop operation needs to be able to look at a stale item.  Boost solves this by owning the stack items and copying/moving your data in.  We did it by leaving a comment in the code and naming the data structure atomic_recycling_stack, implying you have to recycle the items instead of freeing them.  Our basic template setup looks like this:

template<typename any_t, any_t * any_t::*next>
struct atomic_recycling_stack
{

    void push(any_t * elem);
    any_t * pop();

    struct atomic_item {
        any_t * head;
        uintptr_t nonce;
    } item;
};


Like the boost implementation (and any other correct implementation) we use a pointer for the head of the stack as well as an integral nonce, to guard against ABA issues on pop. Unlike the boost implementation, we have some hand-rolled atomics, due to historical reasons. The code base predates C++11 and std::atomic.  Thus the atomic operations on 16 byte objects use gcc builtins, based around the x86 cmpxchg16b instruction.  Also note that, as far as I am able to determine, there is no x86 instruction to atomically read 16 bytes; see e.g. this Stack Overflow thread.  This is an advantage and a disadvantage.

On the minus side, any read from the 16 byte pointer/nonce field could be torn; that is, there may have been an update to the pointer and the nonce between the reads of the two fields, so the pairing is not atomically read.  On the plus side, two 64-bit load instructions are a lot faster than a read followed by a cmpxchg16b to guarantee the caller an atomic read.  So we've traded some correctness for performance.  This is usually manageable.

The typical implementations look like this.  Here atomic_try_update_unsafe is a home-made function that reads the old value (possibly a torn read), calls the lambda, and attempts a compare-and-swap if the lambda returns true, otherwise it does not try a compare-and-swap and just returns.  There's a simple and obvious implementation of this function using the C++11 atomics, an atomic load, and a compare_exchange_weak loop. But remember we're doing this by hand since it was first done before C++11.  For reference, this is roughly what the try-update function would look like if it was using std::atomic:

template<typename any_t, template lambda_t>
bool
atomic_try_update_unsafe(
    std::atomic<any_t> * item,
    lambda_t func)
{

    any_t old = item->load();
    do {
        any_t newer = old;
        if (!func(&newer))
            return false;
        }
    } while (!item->compare_exhange_weak(old, newer);
    return true;
}

You could presumably optimize the above slightly by using acquire/release barriers on the load and compare_exchange_weak instead of the default barriers.  On x86 it doesn't really matter.  In our case for the 16 byte version, where there is no atomic load, we do two 64-bit reads in sequence instead.

As a side note, we use a templated lambda in all of our source code because std::function is pretty messy and will malloc space far too often.  Hiding the type of the lambda in a template parameter seems to get around the issue pretty well.  So with the above definition in mind, the standard lock-free stack looks something like:

void push(any_t * elem)
{

    atomic_try_update_unsafe(&item,
        [elem](atomic_item * ref_v) -> bool
        {
            elem->*next = ref_v->head;
            ref_v->head = elem;
            return true;
        });
}


any_t * pop()
{

    any_t * oldhead;
    atomic_try_update_unsafe(&item,
        [&oldhead](atomic_item * ref_v) -> bool
        {
            oldhead = ref_v->head;
            if (!oldhead) {
                return false;
            }
            ref_v->head = oldhead->*next;
            ref_v->nonce++;
            return true;
        });
    return oldhead;
}


So far this is pretty much the same as the boost implementation, modulo details like the boost lock-free stack owns the containing objects with their next pointer.  The update of the nonce protects against the ABA problem.  So this is correct code and looks very similar to what boost does.

Except in our case it's not correct.  Remember that, unlike C++11 atomics, the old value passed into the lambda may be torn.  Hence the _unsafe suffix in the name of the atomic_try_update function.  Our source code has a similar function atomic_try_update that is guaranteed not to use torn reads; this trades possible correctness for performance.  In general, if your code won't crash when it sees a torn value in the lambda, you're better off using the unsafe version. Note that, typically, a torn read is not an issue because we'll do an atomic cmpxchg16b, so if the old value happened to be a torn read, usually we just bounce on the CAS and try again.

Take a moment to see if you can figure out why the above code is correct for boost (no torn reads) and not for us.

Here's the sequence of events that can cause a failure, on a stack with A -> B -> C, and nonce=17.

Thread 1            Thread 2
------------        ------------
read A/17 (for pop)
prepare B/18
                    read head=A (for pop)
cmpxchg B/18 in
                    read nonce=18 (torn read!)
set A->next = X
                    prepare X/19
push A:
read B/18
A->next = B
cmpxchg A/18 in
                    cmpxchg X/19 in

Now we have some unrelated item X in our stack.

The fix is simple enough; update the nonce on push as well as pop.  Remember, this change is only needed because of the torn read possibility.  Using 16b atomic loads would fix the issue a different way, at the cost of worse performance.

And this is my day job. It's not always like this, but just this week we've fixed this bug as well as one in our lock-free skiplist implementation.  If this sounds interesting or exciting to you, feel free to send me a resume; we're hiring now and we have lots of open headcount.  Either way, happy hacking!

(Note all this code is from my head, so please excuse any small errors).