Thursday, December 10, 2015

Finger Trees End-Of-Term Speech

Apologies for not posting much, I've been contributing to the repository. I haven't yet finished the implementation, but I've made a few changes to what was initially described. As an end-of-term thing, I am delivering a lecture on finger trees.  My lecture notes are after the jump, linked directly below.

Modern Algorithm classes will teach you about what is fast, counting iterations in terms of big O notation. What very few hear much about is the impact of the cache and branch prediction on actual observed performance. I was surprised to hear in my independent study how branch prediction can make a well-balanced partition into a 50-50 worst case. Many graph and tree manipulation algorithms require walking around and copying references. In the execution model shown to most, it seems to be only one step to access far away memory. Adding two integers is as expensive as bursting the cache in this model. This isn't true in the real world.

The impact of how well the cache works with you on modern CPUs is so large that there is even an analysis of efficiency in terms of blocks of cache likely to be pulled into memory, not even considering CPU. You laugh, but most of the time the CPU is spent waiting on memory movement, blocking, spinning it's fingers. Optimizing an algorithm for the thing that isn't actually causing performance loss doesn't make sense. This is the cache-aware, and cache-oblivious algorithm school. It is from this school of algorithms that the B+ trees used in databases, the filesystem algorithms from OSes, and techniques like columnar databases are drawn.

There is a family of algorithms called succinct algorithms that tries to reason about efficiency in representation and in data movement. The idea is that if you store the data in a flat order, you can use bit manipulation to set and unset predicates about the structure.

A classic example is the Hash Array Mapped Trie. (Show HAMT bitmasking) A trie is like a b-tree, where the slots can be empty, a value, or a reference to a node. When two values hash to the same slot, a node is made. An efficient representation of the HAMT will have a bitmask and place a 1 if there is a non-null value in that level's slot. The data will all be in an array, skipping over nulls. Therefore you know the structure of the array, can mask and popcount to the index of the new data element, and use it like an efficient fixed-sized array level. I mention the HAMT because the bit tricks I chose to use for my finger tree data structure have a lot in common with the class of tricks used on HAMTs.

One of the coolest things about succinct and other integer programming (pointer tagging, using compressed pointers) proofs is that you frequently see people cite information theory and explicitly enumerate the things the structure must communicate. Two people using a structure are "communicating" the state change with the bits they set, and succinct data structures try to keep the actual data representation to this lower limit. I urge anybody who wants to see such a proof to refer to Doctor Demaine's video presentation of the proof of the lower bound on the predecessor problem, it's mind-blowing in parts.

But overall I think that we will see more, not less tools like this. Modern computers have been getting faster and faster at SIMD computation. SIMD the same thing to every batch of bits. Succinct data structures allow for massive structural changes in parallel, keeping the work on the registers. The cost of non-nullity should not be a pointer, or a null word, but a bit.

I have chosen the finger tree as a case study in applying the techniques of cache-oblivious/succinct optimizations to an existing structure.

A finger tree is an elegant purely functional data structure that models sequences. It can be used to model priority queues, range trees, and O(log(n)) random-access arrays.

The finger tree has a backbone that is a linked list of backbone nodes. Each backbone node has a number of trees running up the left and right side.

When elements are pushed into this sequence from one side, they first start filling up (Show finger tree filling + emptying on board) the far side, and then they start to fill up the near half, and then, they have an overflow. When overflowing, a bunch of the nodes that have bunched up near the center need to be pushed down a level. At level 1, we have nodes of depth 1. At level 2, we have nodes of depth 2, etc, etc. This means that something will overflow N+1 levels much less frequently than N levels. This can be extrapolated into a formal proof to show that the depth is very shallow, and is logarithmic in terms of the number of elements.

The conventional, let's call it "horizontal," finger tree has a spine that is essentially a linked list. When things are pushed down, they are inserted into the lower level. In a purely functional world, this involves invalidating the bottom node's reference, so the top node also has to be re-recreated. This creates explicit data dependency between the creation of the layers. It's a long, serial process. There is a lot of control flow, a lot of having to move around to tell where elements will live. Worst of all, there is a lot of indirection. For a query to traverse 5 levels requires at worst 5 cache misses. This is unacceptable. I believe that some space can be traded for improved memory performance and overall efficiency.

The first complaint that I had was related to the allocation patterns. If you ever see a diagram of a finger tree spine node, you have one group of nodes on the left here, and one on the right, and you respond to a change by invalidating this entire bunching of 1-4 things and recreating it. Most of the time lists will likely be short, there should be a lightweight way to optimize common access. Well this proved hard, and easy wins are nice. So I looked at something that sucked. When adding an element, we go from having 3 elements on this level N to making that 3-node garbage with a new 4-node, and then we overflow and move these same old 3 elements down to become one node in a lower level. If you repeat this hundreds of times, you've made hundreds of 3-nodes into garbage when making the 4-node, and now have to re-create it. Let's hold onto the 3-node, and embed these groupings directly into the level.  (Show how holding onto 3-node works)

In my new state changes, the state of having 1 element requires holding one ref. 2 references, 2 refs. 3 references, we now need to put these three into a node on the heap. Four, we need a node and a ref. After the fourth, we overflow back to having 2. We are refusing to create memory traffic by holding onto this node of size 3 before pushing it down. This also makes all heap nodes the same size, which is easier on many allocators.

I refer to the first element as the slop or carry position because it is where a 3-node overflows to when we overflow. But I'll address that in a bit. The next element, the one filled when adding another element is the one affix. This is a modification of the existing terminology. The position that the 3-node that hasn't overflowed yet holds, this is the 3-affix.

Now let's look at the slow things: rebalancing and search. We don't want to have to follow those levels of indirection. Let's put them in an array. This has some benefits and some side effects. The benefits are in cache locality. We don't want to have to jump all over memory to access the center of the "sequence." So we accept that we are going to have a bit more space used because on a change, the persistent data structure "snapshot" isn't just the head level. But what a brave new world this is. We have the ability to work with random points inside the sequence very efficiently.

We store the left and right sides of the structure in parallel arrays, or "spines." (Spine diagram) Each spine holds a multidimensional array of width 3, for the slop, 1-affix, and 3-affix. If we have the pointers in columns, and possible nulls, we want to use a bitmask. That leads us to use a bitmask to describe the structure and just putting the data directly in the array. We don't store any nulls, instead having the logic that if an arbitrary bit is set, then we can check the number of bits below it very quickly with the popcount instruction.

This exposes structural transformations as simple bit tricks, followed by memcpy-ing the underlying memory segment. This minimizes the amount of movement necessary to create the new state.

That allowed me to notice a pattern. The entire segment that is part of the overflow will remain a candy-stripe of elements of the same size. The previous 3-node for level N will overflow into the slop on level N+1, N+1's slop moves into the N+1 1-affix, and the N+1 3-affix overflows into the next level.

(Show how the overflow+underflow candy-stripes)

This means that on overflow our vertical structure does... a lot of nothing! We alternate memcpy-ies essentially.  We put the new element at the start, copy the overflowing elements wholesale, do the non-overflowing level, and then the remaining levels left alone. And we can do this in parallel, more or less. Modern CPUs will schedule movement and bit manipulation through the functional units efficiently.

(Work through parallel overflow)

Now most of the search inefficiencies have already been fixed. Let's have one last dive into the finger tree. There is the requirement that the data we put into a finger tree will be on the domain of a function that returns a value. This function is named measure. We also require that a function be defined called combine which takes two measures and combines them. A measure can be an unsigned integer, a hash map, a checksum, etc. The only rule is that if a and b are two measures, then any predicate we want to run over them is true of combine(a, b) if it is true of either a or b. This allows us to monotonically increase the measure until a query goes from false to true. While it sounds restricting, it's very easy to use.

An example is the random access array. The measure is the number of elements in the list under the operation concatenation. The number of elements in a list of size 1 is 1. The number of elements in combine(a, b) is the sum of the sizes of the sub-elements. We can find the 100-th element by setting the base element to 0 and combining all of the measures of along the spines until we find one that makes the query (>= 100) true. We then back up and descend into the tree.

(Show use as random access array)

An optimization that we will make is to store the measure with the pointer rather than at the head of the node. We store the measure with the pointer, and store them both directly inline. This makes it possible to never jump out of the spine unnecessarily in a search. We also never touch nodes that aren't along our search. This is optimal for cache use. It may result in slightly-larger sub-structures in the persistent data structure, but I believe that this massive benefit to cache locality makes up for a bit more memory pressure.

We step into the node, see the reference that causes it to go from true to false, until it finds the leaf responsible for the change. This is the element that has the offset in the random access structure.

I'm not finished implementing it, but it's coming together very well. All of the pieces work together in isolation, but I'm bringing them together. I'm optimistic for how my bit trickery performs with varying levels of compiler optimizations set.

No comments:

Post a Comment