Tuesday, October 27, 2015

Hash Consing with Modern Processors, Your In-Cache Cache

If you have
a '*' 4 == b '*' 4
then you know that a must be equal to b.
So problem solved, right? Well... You carry this messy prime number everywhere. It makes everything more complicated. For example, you cannot work with all 32-bit numbers. Both 0x7FFFFFFF and 0 are zero. We have 0x80000000 == 1 and so forth.
What if there were prime numbers that are powers of two? There is no such thing... when using regular arithmetic... But there are "prime numbers" (called "irreducible polynomials" by mathematicians) that act a bit like they are powers of two when using carry-less multiplications.
With carry-less multiplications, it is possible to define a modulo operation such that
modulo(a "*" c) == modulo(b "*" c)
implies a == b. And it works with all 32-bit or 64-bit integers.
That's very nice, isn't it?
Up until recently, however, this was mostly useful for cryptographers because computing the carry-less multiplications was slow and difficult.
However, something happened in the last few years. All major CPU vendors have introduced fast carry-less multiplications in their hardware (Intel, AMD, ARM, POWER). What is more, the latest Intel processors (Haswell and better) have a carry-less multiplication that is basically as fast as a regular multiplication, except maybe for a higher latency.

That's right: CLHash is much faster that competing alternatives as soon as your strings are a bit large. It can hash 8 input bytes per CPU cycles. You are more likely to run out of memory bandwidth than to wait for this hash function to complete. As far as we can tell, it might be the fastest 64-bit universal hash family on recent Intel processors, by quite a margin.

This great blog post by Daniel Lemire inspired me to put together something about the idea of hash consing in purely functional data structures. It is an optimization that doesn't get enough attention in literature designed for non-experts.

The idea behind hash consing is that in a system with high data throughput (which is most functional languages) one will frequently re-create substructures that are identical to live or recently-freed substructures. For example, one frequently ends up calling two different functions in helper libraries that each need to reverse a list. For a large linked list, this can occupy a lot of space.

In a functional language, a program's control flow is usually dictated by the traversal of a data structure. Computing the first n powers of a given prime, for example, is equivalent to traversing a list of the powers numbers of length n. In a lazy language, a reference to the (n-1)-th value would have a reference to a "thunk", or unevaluated closure, that when called will produce the value. This reference need only be evaluated once; but an arbitrary number of copies of the (n-1)-th power can be computed by making multiple lists with the same base prime. We can duplicate work unnecessarily by evaluating all of these thunks.

There are an entire class of algorithmic and compiler micro-optimizations to avoid this overhead. A fairly general and simple method for doing so is to make the allocator de-duplicating. In a pure language, the result of allocation will not be mutated. Therefore two requests to create the same sub-structure can be given the same reference without any alteration of program semantics or control flow. This is done by hashing the structure to create and checking if it has already been created. If so, the old reference is returned. This creates a pervasive cache throughout every level of the application.

While simple to explain, there are a lot of nuances to this optimization. We need to free structures at some. Let's consider the simple case that we don't keep any substructures that are unreachable from the program. It's only a de-duplication of live objects. When the last reference to a substructure goes out of scope, the cache must be updated. This is typically implemented with "weak references" that are allowed to be made nullable by the garbage collector.

We must choose to either place a finalizer on every object (expensive) or allow for false positives in the cache. False positives are relatively cheap when encountered in the query, since we'd have needed to find the slot to fill with the reference either way. False positives do require a "garbage collection" step for the cache though, where the weak references that have become null are removed from the structure.

The classic reason to avoid hash consing is that it has classically been expensive to compute a hash that is unlikely to collide. The above article brought to mind the fact that modern techniques have made hashing much cheaper. Patrascu and Demaine have shown that simple tabulation hashing works better enough to be game-changing. Alongside hardware primitives, I believe that the cost of hashing has become cheap enough to make revisiting techniques such as hash consing worthwhile. In the worst case, it adds a constant overhead per allocation and provides no gain. In the best case, it performs optimal deduplication of both computation and intermediate data structures.

Hash consing has interesting cache trade-offs. It can keep a data structure small, which improves overall cache behavior due to lower memory pressure. On the other hand, objects allocated at the same time may not be near each other now. An interesting idea in the vein of Demaine's Tango Trees would be to move the deduplicated structure to the most recent allocation site, and move it back on free. For deep structures this has high cost, so heuristics are necessary. Alternatively, the technique could be used to only return objects that are likely to be in the cache. A cache-oblivious hash consing technique seems like an easy win, preventing allocation from dirtying cache lines where possible.

Thursday, October 22, 2015

Optimal Layout of Static Arrays for Search

A recent publication has some surprising results. Elements are commonly stored in sorted order to allow for efficient binary search. An analysis of the different layouts (B-tree, sorted array, van Emde Boas, etc) has found that the Eytzinger layout (invented in 1590!) is actually the optimal layout.


History of Memory Models

This is the last subject from MIT's advanced data structure course that I'll be covering. The history of memory model section covers the different models of execution that algorithm analyses have used, and the unique bounds that richer models offer. Cache obliviousness will play a key role in my future topics of research for this study.

Dynamic Graph Connectivity

The dynamic graph portion of MIT's advanced data structure course is focused on the problem of maintaining a structure that allows for querying connectivity between any two nodes. The "dynamic" part of the title stems from the problem of dynamic optimality that is addressed earlier in the course. Querying the structure causes the structure to rebalance itself such that subsequent queries will be faster. It's like an in-structure cache.

Succinct Suffix Arrays and Trees

The second part of the succinct section follows the previous succint/string section. This focuses on the problem of succinctly representing suffix arrays, an important optimization to keep memory pressure down for string algorithms.

Vertical Finger Trees

A finger tree is an elegant persistent data structure which supports O(1) insertion and O(log(n)) query operations for sequence-like data structures. By using the structure with different annotations, one can represent random access, priority queues, interval trees, and key/value stores in finger trees. While finger trees aren't always the fastest way to build any of these structures, their versatility makes them indispensable.

Finger trees unfortunately have access patterns that modern hardware does a poor job of optimizing.  It has a lot of tiny segments of memory, containing many pointers. Pointer chasing throughout the structure can cause terrible cache behavior. I am going to use tools from succinct and cache-oblivious algorithms to create a finger tree variant with better performance on modern hardware. I've termed my variant a "vertical" finger tree to in contrast to the current "horizontal" finger tree storage style.

Horizontal Finger Trees

Accessing the leftmost and rightmost leaf elements of a conventional tree will take O(log(n)) time. Looking at the first diagram below, we can see that this is the case. If you access the elements at the ends very frequently, this can become unacceptable. This is one reason why stacks, queues, and ordered sequences are very rarely represented by trees.

What if we took this tree, and yanked the ends upwards so that the root of the tree can reference the subtrees at either end in O(1) time? We can do this for the siblings of the subtrees as well.

What you get is a linked list of finger tree "levels," each one referring to a tree of fixed depth. As you move down the tree the depth of the tree increases by 1. At the bottom, we have an empty tree. This empty node is what our tree root used to be. In this way we turn our tree "inside out."

This has some surprising power. The increasing depth allows the rebalancing to be amortized, with the rebalancing charged to the insertions. In the end, appending to either end of the tree is O(1) amortized.

Finger trees also store annotations throughout the tree. There is an interface in functional programming called the "monoid" interface. Annotations are monoids. A monoid can be thought of as a class of entities that have an associative binary operation that combines them, and an identity element for this operation. Finger tree annotations are not simply monoids, they must be monotonically increasing. If I combine two annotations, the result must be at least as large as both arguments, if not larger.

An example is the integers, with 0 as the identity and + as the operation. Another example is lists, with the empty list as the identity and list appending as the operation. When a node is made that refers to subtrees, it's annotation is the accumulation of all of the subtree annotations with the monoidal operation. Finger tree levels store the accumulation of all lower levels. Since these are made at node creation time, maintaining annotations adds a constant overhead to the O(1) node-creation operation.

Finger trees expose an operation, called split, that is the source of their power. Split takes a predicate and find the smallest prefix of the sequence in the finger tree whose annotation sum makes the predicate true. This is a unique point because the sum will be monotonically increasing.

Split will start at the head of the finger level list. It will check if the left affix's sum is large enough to make the predicate true. If so, it will enter the affix and find out which subtree causes the change and recurse. This is O(log(| size of affix |)). If the left affix isn't sufficient, it can add the left affix's sum to the accumulated sum to find out if the predicate change happens in a lower level. If it doesn't we can skip right to the right affix. This takes worst-case O(log(n)) time; see the structure's paper for the proof.

This is the trick that takes finger trees from "that's interesting" to "this is the building block I'll use for all sequences from here on out." Haskell actually uses finger trees for their sequence data structure, the workhorse of the language. If we make the annotation the size of the subtree, we get a random access structure. If we make the annotation the largest element in the subtree we get a priority queue. The paper also describes ordered sequences and interval trees. In short, finger trees can represent many data structures. Most interestingly, one can pick an annotation that embeds multiple annotations to have a structure that can behave like multiple abstract data types.

For those looking to understand the conventional finger tree structure better, the paper which presents the structure is


I've found that http://andrew.gibiansky.com/blog/haskell/finger-trees/ is the most approachable introduction to the structure. That post is the source of the above diagrams.

Minimizing Pointer Chasing:

I believe that the most important change to make is to decouple the fingers of the finger tree. One will tend to access only one side of the tree much more than they will access both (appending to a singleton is the only real exception), so it makes sense to represent a finger tree as two arrays of references to affixes. The left affixes have their references stored in one array, while the right affixes are in another array. I'll refer to these as the finger tree's vertical spines.

A conventional finger tree's "finger-level" storage is an incredibly cache-unfriendly. It contains many references that must be chased. Consider the important task of traversing the tree from left to right. One will visit each of the left affixes in essentially a linear scan down the left affix of the finger levels, and then travel back up the right affixes. Moving from one affix to the next takes two pointer dereferences to segments of memory allocated at different times, and possibly very far apart.

Furthermore, I believe that the layout of fingers assumes unrealistically large structures. If most appends don't overflow, then it becomes unlikely that an append would require updating more than one finger. This is a good justification for the current storage, where each level refers to the next element. But is avoiding having to "invalidate" persistent substructures by fixing pointers worth the overhead?

How deep would the tree have to be for this to be worthwhile? How deep is a finger tree likely to become?

We know that each level's affix holds references of an increasing depth, and that each finger level's affixes can hold 8 elements (a 4-affix on each side). So the number of elements a tree of depth N can hold is equal to the sum of i from 0 to N of (8 * (3 ** i)).

For a tree to require 5 levels, it would have to hold 2,912 elements. For a tree to require 8 levels, it would have to have 78,728 items. Since recreating a finger would require allocating at least 3 words of memory, we get the entire spine with a minimal overhead.

At some point it makes sense to break the spines up into arraylets, but the gains that we get from cache friendliness should swamp the cost of a bit more memory pressure. Is this safe? Could we end up with too many references? What's the largest depth on a 64-bit machine?

If the sum from 0 to N of (8 * (3 ** i)) is equal to 2 ** 64 then N = 38.118.

This means that for a finger tree to hold a reference to every single address of memory on a 64-bit system with full buss saturation (1.8 x 10 ** 19 bytes of memory), it would only need spines of length 39. This is an acceptable overhead.

Better Memory Usage of Affixes

Finger trees also exhibit a pathology with respect to memory management. Persistent data structures get their power from the de-duplication of substructures between versions. Finger trees discard affix levels in a very wasteful way, freeing a structure that it must remake upon the next insertion. When inserting into a finger tree, the top level of prefixes and affixes are entirely thrown away. While this is a minor waste, as the trees are small, it's something that happens very frequently.

It becomes an even uglier waste when one considers the steps that occur when one inserts into an affix that contains three items. We discard the affix's top node that contains three elements to create one with four elements. The next time that we insert, we need to create a 3-element node to push down to the next level along with the 2-element node to make the current level's affix reference point to. But this 3-element node would contain the values that we just freed. This is absurd! We create, free, and recreate every single branch that makes it past the first level. This taxes the allocator much more than is necessary.

A minor optimization here that prevents a lot of this waste is to not free this 3-element node. Since every single node besides the ones directly referred to by the finger tree levels will contain 3 elements, when we want a 4-element affix, we should simply store the fourth element in the finger level and not change the 3-branch. I'll refer to this separately-stored overflow value as the "slop" value.

We must deal with the degenerate case of the single-element finger level. We will store this single value in the left overflow slot for that finger level.

Since we no longer have a finger "level" as a result of putting the affixes in spines, we will have to store these overflows in their own spines. We'll refer to these as "slop" spines. We will be able to use a space-saving optimization taken from hash array mapped tries (abbreviated HAMTs), which will be described in the next section.

This doesn't complicate the rebalancing any. In fact, it allows for a cache-oblivious rebalancing that requires much less pointer chasing, as the rebalancing section will describe.

Slop Spine Compression

Many of the fingers will not have overflows. In fact in a finger tree made from concatenation will only have overflows at the bottom (see "nodes" function from the reference article above to see why). Since having many frequently-changing, mostly-empty arrays can lead to a lot of bloat in a purely functional persistent structure, I chose to use the popcount-based array compression technique.

A slop spine will contain at most 32 elements. If the spine exceeds this (good god, the finger tree would have to contain 22236242266222088 elements) then a linked list of arrays can be used. A spine will contain no null values.

Each spine will have an integer bitmap associated with this. This bitmap will contain a 1 where there is a slop value in the index of that bit offset and a 0 where there is no value. Accessing the n-th slop value involves masking out the bits above the offset, finding the number of ones below that offset (x86 has a popcnt instruction for this, but there are other rank/select structures for this), and using that offset in the array. We've gotten null-freedom at the cost of a single word. Furthermore we can represent a null array by simply making the reference to the array null.

Since the spines are small anyways, is this worth the effort? We also know that if we have a sequence of ones at the start of the bitmask, that we will overflow until the next 0. The index of the first finger that doesn't overflow can be found by flipping all bits in the bitmask and finding the leading 1 (previously the first non-overflowing value or the first 0). There is an x86 CPU instruction for this, CLZ. Many compilers and CPUs have optimizations for this and intrinsics. This is fast. We'll see below why knowing how many finger levels will overflow is important.

In short, this is a worthwhile optimization.

Parallel Rebalancing

Conventional finger tree rebalancing is a process with very many sequential dependencies. A tree is checked and is found to have four items, two new trees are made (the one for the current level and the one for the next), the next level is appended to (possibly overflowing and repeating this), and then the
future version of the current level can be returned with all of the references set. On actual hardware, this runs into a more fundamental problem than poor cache behavior.

Computers have largely gotten faster due to more and more Instruction Level Parallelism on modern CPUs. Instructions that do not alter the same memory locations or registers will be run at the same time, since logically they can commute without altering program execution. When you have a long sequence of instructions in which the product of one instruction is necessary for the next, such as the one outlined above, the pipeline of instructions executing at any one time will come to a screeching halt. You'll have many CPU cycles in which large portions of your possible resources have to sit unused.

Restructuring an algorithm into one in which logical anti-dependencies do not have incidental data dependencies can lead to massive performance gains. Furthermore, if one wants to makes sure that their parallelism is exploited and they know that operations will be identical across their data, they can turn to the Single Instruction Multiple Data operations that are present on modern CPUs. SIMD operations save the CPU from messy heuristics, and even free up registers for other instructions to use (SIMD instructions use a different set of registers).

If you have a chain of rebalancing that needs to be done, only the non-overflowing case will require different logic from the others. Logically what happens in the new array layout is that the value at finger level N and the slop value from finger level N+1 are made into a new 2-branch at level N+1. The top level contains the top slop and the newly-inserted value.

Since a chain of overflows requires a sequential list of slop values, we can simply use parallel arrays with an i+1 indexing the slop array and an i for the old-affix array to create the new 2-branches. These addresses can then be copied into the relevant fields of the new version of the relevant affix spine. We can do all of these operations with the commonly available x86 SIMD instructions. Since spines should be relatively small, all of the rebalancing will likely take a single SIMD loop followed by the non-overflow case and the structure finalization.

Since we need to compute the annotations for all of these new nodes referred to by affixes, and since this annotation must be a monoid and thus only dependent upon the annotations of values in the node, we can compute all of these in SIMD parallel as well.

Maintaining the bitmask structure is also surprisingly independent of the data we're generating. We first add slop values at the root, and only push them downwards if they overflow. We already know how long the chain of overflows will be, so we can mask all of the slop values that had overflowed to 0, and need only to check if the non-overflow case will yield a slop value or not.

We see that we've opened ourselves up to a world of compiler and CPU optimization. We've gone from a sequential tree modification algorithm to a highly parallel array processing one with explicit anti-dependence between steps.


Split works by moving through the finger levels and chasing the either left affix, the next finger level, or the right affix based on which has the value that causes the predicate to go from True to False. It has a really clever micro-optimization to keep access O(1). Each finger level stores the cumulative measure from the bottom. This lets us know that if the prefix's measure and the next level's measure combine to a value that does not cause the predicate to become true, then the value is in the right affix at the current level. Therefore conventional finger trees descend only the strictly necessary number of levels when traversing. The current algorithm has some problems for us though.

Firstly, the conventional algorithm has been broken by the fact that we use a slop slot rather than chasing the annotation in the affix. Since this value comes on the left though, we can simply perform another comparison and treat this as a constant overhead per level. It's not free, but it's cheap since the finger spine is relatively small compared to trees at the lower levels. We'll focus on making those cheaper.

Once we've found the finger level and affix that has the split point, we need to traverse our 3-level tree to find the point. This is where we start jumping around the cache unwisely. Conventional finger trees fetch the node and then decides if the measure change happens there. That is, the finger tree pays the cost of a cache access before knowing whether it needed the data at all. We make the observation that in a given finger tree there will only be one reference to a 3-branch at a given time. While a persistent data structure may have many, many versions referring to the same 3-branch, this is not the common case.

We will store the measure of a branch3 alongside it's reference. With the restriction that a measure be a word, this should be able to fit in a doubleword. We've suddenly added a factor of two to our 3-branch node sizes and our affix array sizes (although each 3-branch now doesn't have to store it's own measure), but have gained the ability to only chase the pointers that we need to.

Memory Management

Manual memory management for purely functional structures can be very taxing and inefficient because of the access patterns for persistent structures. Some substructures will have an incredibly large number of incoming references and others will live for an incredibly short time period. Finger trees are constantly criticized for having unintuitive memory management patterns and requiring a tracing garbage collector. We'll address the issues, but first let's introduce some terminology and properties.

We'll refer to a block of memory allocated at once as an "allocation cell". Memory management tends to involve either having reference counting, a tracing collection system, or a system that manages lifetime semantics explicitly. The latter is error-prone, being the cause of most catastrophic software failures in the last few decades. It can also lead to an unacceptable amount of explicit bookkeeping. Tracing and reference counting, both known as "garbage collection" systems, are our best bets here.

Tracing consists of finding all of the live "root references," in this case the live references to versions of our finger tree, and marking the transitive closure of those roots as alive. Objects not marked as alive can then be freed. In practice, most systems will use this memory to satisfy the next allocation, rather than free it. When allocations are different sizes, there is a problem of fragmentation. Most collectors get around this by moving objects and fixing references. This has the problem of having a lot of memory traffic, essentially busting the cache. If a finger tree operation required a moving collection, every single one of it's old in-structure cache lines are suddenly useless. These immutable objects have to be re-fetched from memory, which can be pretty wasteful.

Reference counting consists of having each allocation store the number of incoming references, and modifying the structure to account for these changes. The cost of storing the reference count can be done with a single unsigned word-sized integer. If it couldn't then the number of live objects would have to extend the reachability of a pointer. If there is a reference cycle, that is if there is a cycle in the object graph, then reference counting cannot free the memory. This is reference counting's big failure state, along with heavily-mutated objects. But finger trees don't actually have this problem. As our allocations are immutable, it is impossible to create a reference cycle. They form a logical vector clock as a reference must come from an earlier allocation, and so references have a total ordering.

Reference counting is desirable because it is prompt. If a very large structure becomes garbage, the system's memory usage will not go down until the tracing collector has run and freed the memory. In a system where memory is re-used rather than freed and objects are not moved, a worst-case scenario could prevent the memory usage of the application from ever going down. This is barely acceptable for a language's garbage collector, it is not acceptable for a lightweight data structure library.

Memory management can be done by using any off-the-shelf reference counting system. An important property to prevent heavy modification of reference counts is to only hand out references to the root of the structure. The allocations done by the data structure will only have references inside the structure, keeping the amount of modification that must be done to a minimum. My reference implementation will use C++'s shared_ptr.


Now that all structural changes have been described, I thought that diagrams might be useful.


The tree above has the first three levels as nodes with more than a single element, and the fourth level has a single element (overflowed to left "slop" slot). The bitmask shows that the right slop array is empty and the left one has an element in index 1 and 4. An empty slop array is written here as an allocated zero bitmask, but the root reference to it will just be null in real life. There's no need to allocate a 1-byte array to hold a null.

FullNode at depth 1 has child elements that contain the datatype that the tree holds.


You can follow my work implementing this structure here: https://github.com/alexanderkyte/VerticalFingerTree.

Actual benchmarks are forthcoming. I intend to use http://www.simplescalar.com/ as it has a history of use in analyzing cache-oblivious algorithms.

Monday, October 5, 2015

Strings and Succinct Data Structures

This last week I covered the "strings" section of the MIT lecture and the first of the two "succinct" data structure lectures. 
The strings section presented many of the problems that have motivated the succinct data structures that are covered in this course. I found the succinct data structure section very interesting. There are a few key ideas which underly a lot of the succinct data structures. Looking briefly into some of the papers mentioned shows this to be a sub-field with a history of mental gymnastics.
I'm quite a fan of the balanced parenthesis representation, as it represents the shape of the tree without naming its interior nodes. This is interesting, as it allows for memoization and other optimizations to arise out of the exposed generality.