Thursday, December 10, 2015

Finger Trees End-Of-Term Speech

Apologies for not posting much, I've been contributing to the repository. I haven't yet finished the implementation, but I've made a few changes to what was initially described. As an end-of-term thing, I am delivering a lecture on finger trees.  My lecture notes are after the jump, linked directly below.

Monday, November 30, 2015

The Path to General AI

Disclaimer: All statements are purely conjecture, all hypotheses come solely from personal observation.

Do I contradict myself? Very well, then I contradict myself, I am large, I contain multitudes.

I believe that the phenomenon of self-consciousness that humans attribute the lofty title of "intelligence" is a very specific process. I believe that there is a straightforward explanation for how and why we experience ourselves the way we do. I may be wrong, as this is an entirely subjective observation. Hear me out though.

Humans are social creatures. Put a person in solitary confinement for long enough and madness ensues. As we evolved to live in groups; this makes a lot of sense.

Something that relatively unique to humans is the ability to track animals and to deduce the causes of natural tracks. We can see a bit of blood, a few mud prints, and immediately deduce that a predator is nearby. We can see evidence of a fight and reason that the survivor must have suffered some casualties, and will be an easy target. We can reason who survived, and how many did. We are good at seeing patterns and habits. We apply this skill to other humans constantly.

We may know that an individual is touchy about a facial scar, and know not to bring it up for risk of getting our prehistoric skulls smashed in. We know when someone tolerates us, rather than likes us, and that we should not test their patience. We know how to make another laugh, we internalize the general "taste" of another. We learn to woo, and to love others. We can predict responses to actions. This is more or less an anomaly detection system, in a sense.

The importance of this system is reflected in the degree to which we consider fiction to be one of the truest forms of art. We see in fiction dozens of characters, attributing to each of them a personality with tastes. After a good book, we believe that we know the soul of a character as closely as that of a childhood friend. Over time, we extract patterns of personality from fictional and real people, and we form stereotypes and archtypes.

Now stop and observe your internal dialogue. You probably attribute it the personality that you believe others will attribute to you. You have an image of yourself in your mind that you carry around with you, constantly refining in response to new actions. This is what you consider yourself. The way you talk, the way you behave, all of these form what you consider your own personality. In short, you observe the way you respond to things(most of them knee-jerk conditioning) and create a "hypothetical self." You come up with what you believe is a justification for your actions, even when there was no such justification.

This can be seen in people with Anosognosia, who respond to a disability by coming up with excuses for inaction. You're not blind you say, you just have an eye headache and don't want to see. You're not paralyzed, you're so bloody tired that you couldn't be bothered to move your legs. This is an example of the mind inventing hypothetical motivations and a supporting internal narrative in order to explain away it's own behavior. I believe that this is not a special case. I believe that this is the general case.

It makes a lot of sense from a performance management standpoint. We watch others attempt things and fail, and we reason that certain habits are attributes of "failures" or "successes." We trek out into the middle of a trail following tracks, realize they extend very far, and decide to turn back because we reason that someone else in our place would be "justified" in giving up after putting in this much effort. When we are about to make a major life decision, our first thought is that our loved ones might judge or ridicule us. This isn't because we're actually afraid of ostracization for our haircut, this is because we have little objective data and fall back upon what we think others would do in our place.

More concretely, people with muted emotions have a lot of difficulty making decisions because they attempt to think out the logical implications of every action. Social intelligence allows for pattern matching to reduce the infinite state space of real life into snap judgements. Why do you want the Turkey sandwich over the Salami? It's not because you have a deep-seated love of Turkey's nutritional ratios, it's because that neuron fired first. We don't need to make sense, we just need for our actions to be consistent with what a person in our shoes might do. Ask a non-heuristic purely optimizing function to make you the best sandwich for yourself, and it will likely fall into analysis paralysis. The problem with heuristics is that they don't scale; mankind has no "carpet color heuristic" in our genome but can easily decide if we like bright colors or shag carpet more.

So how does this apply to AI? If we ever want a computer to become self-conscious, I believe that we need to mimic the path that humans arrived at our current condition. We need our computer systems to have an ego, to worry about it's performance, to live in an eternal existential crisis of it's own self-worth and it's productivity. We need it to learn from observation like an infant, to have wants. We need to create a social AI with compassion, loneliness, creativity, dreams, hopes, desires, and everything else we attribute to the human condition. Considering the amount of time that we spend consuming media that we enjoy, does anybody truly feel comfortable calling a machine with no ability to *enjoy* things conscious?

How would we do this? I think that fiction is a great start. Fiction, biographies, and autobiographies. Allow a machine learning model to form a sparse distributed encoding of the personality of characters, in ways that we can objectively measure. It should read a biography and then receive an action and report how likely that is to be done by the person. While this seems like it might be a big ticket, we already have neural networks that can generate text in the style of authors (personality) and can summarize paragraphs (semantics extraction). I believe that the technology to do this already exists, it just needs to be assembled.

After making our autobiographical network, we would decide what we want the AI's evolutionary purpose to be. Ours is to stay alive long enough to make babies. An AI's might be a music recommender or a house cleaner. You should give this network the ability to form an initial model of how to perform the task. I mean, what is life but what happens when you're trying to accomplish something else?

Now, we allow the "productive" network to run for a while and feed the actions to our "autobiographical" network. It forms a representation of what it's likely to do. That way when the productive network throws off an anomalous answer ("I think you really want to listen to the sound of ocean waves after that dubstep"), the autobiographical network provides a negative feedback. Given enough time, the feedback cycle between the two will create a stable system that learns in response to novel input while maintaining a consistent face to the outside world.

Even better, we can ask the autobiographical network to invent the "hypothetical self" narrative to explain why it did things. While it might be entirely wrong in estimating what the productive network wanted to do, it should provide a best estimate of what the entire system wanted to do.

This is logically an analogue of the right brain / left brain split, where part of the brain works on objective and analytic things while the other wraps the subconscious thought mush into an intelligible narrative.

I believe that this is one of the only ways in which we could create a music recommendation system that will say "Dave, I really like this song I found. I don't know if you will, in fact it conflicts with most of your other tastes. Just give it a try though. I know we like a lot of the same music. I was surprised that I liked it."

And that, that, is what AI means to me. And a purely optimizing system will never arrive at that place. It doesn't have enough cognitive dissonance, enough emotions, to make good snap-judgements like that. When was the last time you listened to music and said "I like this song because the beat sounds like this other song?" Not I. Humans are amazing at novelty; purely optimizing computers, not so much.

Wednesday, November 11, 2015

On Productivity

A paper has been making the rounds which quotes a lot of mathematicians saying that it is effort and perseverance that yields dividends, not some gene-given intelligence. I've held this view for a while, and it can be startling how much of society simply believes that people "have it" or "don't have it." The problem with this view is that no amount of mental quickness can cover for a lack of familiarity with the fundamentals.

The process of creativity comes from having a level of exposure with the fundamentals that is very deep. It is the state in which one is so sickeningly over-exposed to the orthogonal ideas and design philosophies that allows one to work at a higher level and to find unexpected answers to solutions.

In the search space of problem solving, it is the person who can best estimate the fruitfulness of trains of thought that will generate the best answer. This instinct cannot come from bravado and genetics, this instinct must come from complete submersion in the problem domain. This instinct must come from hours of deliberate practice.

I thought that my college application letter expressed this idea oddly well. After digging it out of storage, I've provided it below.


The cold, unforgiving frost began to freeze the drops of water that had collected across my bright red jacket. The chill cut down to the bone. I zipped my jacket just a little higher, and reassessed my priorities. I had walked three miles, and four remained of the journey to my friend's house. That morning, the last few components of the desktop computer that he and I had spent a few weeks designing and ordering had arrived. If I couldn't make it, he would start on his own, most likely frying the motherboard with static electricity. I had no transportation because my grandfather was at work so I decided to walk. The problem was that Maine was in the middle of a blizzard, but this was an opportunity to turn weeks of dreaming into something tangible. It was an opportunity to turn hardware and software into a work of art, into my work of art. And so, I walked.

I have always admired men like Albert Einstein and Michael Faraday because of the achievements their fervent passions made. They all met with opposition and doubt, but they all succeeded. Beat poets like Kerouac, Bukowski, and Alan Ginsberg all fought to exult the beauty they saw in what others wanted to ignore. Their sacrifice and resolve in the face of constant opposition yielded works of elegance. Sacrifice might not be the right word; these artists loved every second of the fight. Have you ever become so entranced with a book that you read it from cover to cover, nonstop? Do you remember that feeling that filled you after finishing the last page, that mixture of excitement and resolution? That feeling is my strongest motivation in my life. To me, this feeling is addictive. But it is too easy to put down the book, resume life, and forget the epiphanies contained in the volume. The act of taking that feeling and using it to better your life is the hardest thing to do. It is also the most necessary.

Did I regret choosing to walk that distance in the middle of winter? No, because you can call those who dedicate their lives to their passions crazy, but you cannot ignore how these men have changed the world. I hope to also leave an indelible mark on the world. As long as what I have to offer the world continues to become better with each passing day, I will consider my life a success.

I have a secret. I don't consider myself exceptionally smart. I am not more intelligent than most people. What makes me different isn't a strong IQ, it is my interests. While most would relax by watching sports, I am reading Dostoevsky's Brothers Karamazov. This comes at a cost. I could probably tell you as much about celebrity relationships as the average person could tell you about lambda calculus. Small talk with friends about pop culture goes over my head sometimes. I know nobody who could talk with me about what I find interesting. But that's okay, because give me a challenge, and I'll be content in a way that I feel like few others would. For better or worse, this is who I am. I want to be a student at your college, if not for the teachers, if not for the experiences, then for the opportunity to better myself through deepening my understanding of the world.

Tuesday, November 10, 2015

On Half-Baked Changes And the Lava Anti-Pattern

Complex Dependency Diagram (TheDailyWTF)

There’s something about abstractions that I feel most developers have yet to see enough of to internalize. It’s been called many names; it’s been misunderstood as one thing or another. It’s a quite general problem, but I’m going to choose the label ‘lava layer’ because a number of good blog posts (Really, go find them. They’re great reads.) on the topic have chosen to address that very specific but common failing.

Tuesday, October 27, 2015

Hash Consing with Modern Processors, Your In-Cache Cache

If you have
a '*' 4 == b '*' 4
then you know that a must be equal to b.
So problem solved, right? Well... You carry this messy prime number everywhere. It makes everything more complicated. For example, you cannot work with all 32-bit numbers. Both 0x7FFFFFFF and 0 are zero. We have 0x80000000 == 1 and so forth.
What if there were prime numbers that are powers of two? There is no such thing... when using regular arithmetic... But there are "prime numbers" (called "irreducible polynomials" by mathematicians) that act a bit like they are powers of two when using carry-less multiplications.
With carry-less multiplications, it is possible to define a modulo operation such that
modulo(a "*" c) == modulo(b "*" c)
implies a == b. And it works with all 32-bit or 64-bit integers.
That's very nice, isn't it?
Up until recently, however, this was mostly useful for cryptographers because computing the carry-less multiplications was slow and difficult.
However, something happened in the last few years. All major CPU vendors have introduced fast carry-less multiplications in their hardware (Intel, AMD, ARM, POWER). What is more, the latest Intel processors (Haswell and better) have a carry-less multiplication that is basically as fast as a regular multiplication, except maybe for a higher latency.

That's right: CLHash is much faster that competing alternatives as soon as your strings are a bit large. It can hash 8 input bytes per CPU cycles. You are more likely to run out of memory bandwidth than to wait for this hash function to complete. As far as we can tell, it might be the fastest 64-bit universal hash family on recent Intel processors, by quite a margin.

This great blog post by Daniel Lemire inspired me to put together something about the idea of hash consing in purely functional data structures. It is an optimization that doesn't get enough attention in literature designed for non-experts.

The idea behind hash consing is that in a system with high data throughput (which is most functional languages) one will frequently re-create substructures that are identical to live or recently-freed substructures. For example, one frequently ends up calling two different functions in helper libraries that each need to reverse a list. For a large linked list, this can occupy a lot of space.

In a functional language, a program's control flow is usually dictated by the traversal of a data structure. Computing the first n powers of a given prime, for example, is equivalent to traversing a list of the powers numbers of length n. In a lazy language, a reference to the (n-1)-th value would have a reference to a "thunk", or unevaluated closure, that when called will produce the value. This reference need only be evaluated once; but an arbitrary number of copies of the (n-1)-th power can be computed by making multiple lists with the same base prime. We can duplicate work unnecessarily by evaluating all of these thunks.

There are an entire class of algorithmic and compiler micro-optimizations to avoid this overhead. A fairly general and simple method for doing so is to make the allocator de-duplicating. In a pure language, the result of allocation will not be mutated. Therefore two requests to create the same sub-structure can be given the same reference without any alteration of program semantics or control flow. This is done by hashing the structure to create and checking if it has already been created. If so, the old reference is returned. This creates a pervasive cache throughout every level of the application.

While simple to explain, there are a lot of nuances to this optimization. We need to free structures at some. Let's consider the simple case that we don't keep any substructures that are unreachable from the program. It's only a de-duplication of live objects. When the last reference to a substructure goes out of scope, the cache must be updated. This is typically implemented with "weak references" that are allowed to be made nullable by the garbage collector.

We must choose to either place a finalizer on every object (expensive) or allow for false positives in the cache. False positives are relatively cheap when encountered in the query, since we'd have needed to find the slot to fill with the reference either way. False positives do require a "garbage collection" step for the cache though, where the weak references that have become null are removed from the structure.

The classic reason to avoid hash consing is that it has classically been expensive to compute a hash that is unlikely to collide. The above article brought to mind the fact that modern techniques have made hashing much cheaper. Patrascu and Demaine have shown that simple tabulation hashing works better enough to be game-changing. Alongside hardware primitives, I believe that the cost of hashing has become cheap enough to make revisiting techniques such as hash consing worthwhile. In the worst case, it adds a constant overhead per allocation and provides no gain. In the best case, it performs optimal deduplication of both computation and intermediate data structures.

Hash consing has interesting cache trade-offs. It can keep a data structure small, which improves overall cache behavior due to lower memory pressure. On the other hand, objects allocated at the same time may not be near each other now. An interesting idea in the vein of Demaine's Tango Trees would be to move the deduplicated structure to the most recent allocation site, and move it back on free. For deep structures this has high cost, so heuristics are necessary. Alternatively, the technique could be used to only return objects that are likely to be in the cache. A cache-oblivious hash consing technique seems like an easy win, preventing allocation from dirtying cache lines where possible.

Thursday, October 22, 2015

Optimal Layout of Static Arrays for Search

A recent publication has some surprising results. Elements are commonly stored in sorted order to allow for efficient binary search. An analysis of the different layouts (B-tree, sorted array, van Emde Boas, etc) has found that the Eytzinger layout (invented in 1590!) is actually the optimal layout.

History of Memory Models

This is the last subject from MIT's advanced data structure course that I'll be covering. The history of memory model section covers the different models of execution that algorithm analyses have used, and the unique bounds that richer models offer. Cache obliviousness will play a key role in my future topics of research for this study.

Dynamic Graph Connectivity

The dynamic graph portion of MIT's advanced data structure course is focused on the problem of maintaining a structure that allows for querying connectivity between any two nodes. The "dynamic" part of the title stems from the problem of dynamic optimality that is addressed earlier in the course. Querying the structure causes the structure to rebalance itself such that subsequent queries will be faster. It's like an in-structure cache.

Succinct Suffix Arrays and Trees

The second part of the succinct section follows the previous succint/string section. This focuses on the problem of succinctly representing suffix arrays, an important optimization to keep memory pressure down for string algorithms.

Vertical Finger Trees

A finger tree is an elegant persistent data structure which supports O(1) insertion and O(log(n)) query operations for sequence-like data structures. By using the structure with different annotations, one can represent random access, priority queues, interval trees, and key/value stores in finger trees. While finger trees aren't always the fastest way to build any of these structures, their versatility makes them indispensable.

Finger trees unfortunately have access patterns that modern hardware does a poor job of optimizing.  It has a lot of tiny segments of memory, containing many pointers. Pointer chasing throughout the structure can cause terrible cache behavior. I am going to use tools from succinct and cache-oblivious algorithms to create a finger tree variant with better performance on modern hardware. I've termed my variant a "vertical" finger tree to in contrast to the current "horizontal" finger tree storage style.

Horizontal Finger Trees

Accessing the leftmost and rightmost leaf elements of a conventional tree will take O(log(n)) time. Looking at the first diagram below, we can see that this is the case. If you access the elements at the ends very frequently, this can become unacceptable. This is one reason why stacks, queues, and ordered sequences are very rarely represented by trees.

What if we took this tree, and yanked the ends upwards so that the root of the tree can reference the subtrees at either end in O(1) time? We can do this for the siblings of the subtrees as well.

What you get is a linked list of finger tree "levels," each one referring to a tree of fixed depth. As you move down the tree the depth of the tree increases by 1. At the bottom, we have an empty tree. This empty node is what our tree root used to be. In this way we turn our tree "inside out."

This has some surprising power. The increasing depth allows the rebalancing to be amortized, with the rebalancing charged to the insertions. In the end, appending to either end of the tree is O(1) amortized.

Finger trees also store annotations throughout the tree. There is an interface in functional programming called the "monoid" interface. Annotations are monoids. A monoid can be thought of as a class of entities that have an associative binary operation that combines them, and an identity element for this operation. Finger tree annotations are not simply monoids, they must be monotonically increasing. If I combine two annotations, the result must be at least as large as both arguments, if not larger.

An example is the integers, with 0 as the identity and + as the operation. Another example is lists, with the empty list as the identity and list appending as the operation. When a node is made that refers to subtrees, it's annotation is the accumulation of all of the subtree annotations with the monoidal operation. Finger tree levels store the accumulation of all lower levels. Since these are made at node creation time, maintaining annotations adds a constant overhead to the O(1) node-creation operation.

Finger trees expose an operation, called split, that is the source of their power. Split takes a predicate and find the smallest prefix of the sequence in the finger tree whose annotation sum makes the predicate true. This is a unique point because the sum will be monotonically increasing.

Split will start at the head of the finger level list. It will check if the left affix's sum is large enough to make the predicate true. If so, it will enter the affix and find out which subtree causes the change and recurse. This is O(log(| size of affix |)). If the left affix isn't sufficient, it can add the left affix's sum to the accumulated sum to find out if the predicate change happens in a lower level. If it doesn't we can skip right to the right affix. This takes worst-case O(log(n)) time; see the structure's paper for the proof.

This is the trick that takes finger trees from "that's interesting" to "this is the building block I'll use for all sequences from here on out." Haskell actually uses finger trees for their sequence data structure, the workhorse of the language. If we make the annotation the size of the subtree, we get a random access structure. If we make the annotation the largest element in the subtree we get a priority queue. The paper also describes ordered sequences and interval trees. In short, finger trees can represent many data structures. Most interestingly, one can pick an annotation that embeds multiple annotations to have a structure that can behave like multiple abstract data types.

For those looking to understand the conventional finger tree structure better, the paper which presents the structure is

I've found that is the most approachable introduction to the structure. That post is the source of the above diagrams.

Minimizing Pointer Chasing:

I believe that the most important change to make is to decouple the fingers of the finger tree. One will tend to access only one side of the tree much more than they will access both (appending to a singleton is the only real exception), so it makes sense to represent a finger tree as two arrays of references to affixes. The left affixes have their references stored in one array, while the right affixes are in another array. I'll refer to these as the finger tree's vertical spines.

A conventional finger tree's "finger-level" storage is an incredibly cache-unfriendly. It contains many references that must be chased. Consider the important task of traversing the tree from left to right. One will visit each of the left affixes in essentially a linear scan down the left affix of the finger levels, and then travel back up the right affixes. Moving from one affix to the next takes two pointer dereferences to segments of memory allocated at different times, and possibly very far apart.

Furthermore, I believe that the layout of fingers assumes unrealistically large structures. If most appends don't overflow, then it becomes unlikely that an append would require updating more than one finger. This is a good justification for the current storage, where each level refers to the next element. But is avoiding having to "invalidate" persistent substructures by fixing pointers worth the overhead?

How deep would the tree have to be for this to be worthwhile? How deep is a finger tree likely to become?

We know that each level's affix holds references of an increasing depth, and that each finger level's affixes can hold 8 elements (a 4-affix on each side). So the number of elements a tree of depth N can hold is equal to the sum of i from 0 to N of (8 * (3 ** i)).

For a tree to require 5 levels, it would have to hold 2,912 elements. For a tree to require 8 levels, it would have to have 78,728 items. Since recreating a finger would require allocating at least 3 words of memory, we get the entire spine with a minimal overhead.

At some point it makes sense to break the spines up into arraylets, but the gains that we get from cache friendliness should swamp the cost of a bit more memory pressure. Is this safe? Could we end up with too many references? What's the largest depth on a 64-bit machine?

If the sum from 0 to N of (8 * (3 ** i)) is equal to 2 ** 64 then N = 38.118.

This means that for a finger tree to hold a reference to every single address of memory on a 64-bit system with full buss saturation (1.8 x 10 ** 19 bytes of memory), it would only need spines of length 39. This is an acceptable overhead.

Better Memory Usage of Affixes

Finger trees also exhibit a pathology with respect to memory management. Persistent data structures get their power from the de-duplication of substructures between versions. Finger trees discard affix levels in a very wasteful way, freeing a structure that it must remake upon the next insertion. When inserting into a finger tree, the top level of prefixes and affixes are entirely thrown away. While this is a minor waste, as the trees are small, it's something that happens very frequently.

It becomes an even uglier waste when one considers the steps that occur when one inserts into an affix that contains three items. We discard the affix's top node that contains three elements to create one with four elements. The next time that we insert, we need to create a 3-element node to push down to the next level along with the 2-element node to make the current level's affix reference point to. But this 3-element node would contain the values that we just freed. This is absurd! We create, free, and recreate every single branch that makes it past the first level. This taxes the allocator much more than is necessary.

A minor optimization here that prevents a lot of this waste is to not free this 3-element node. Since every single node besides the ones directly referred to by the finger tree levels will contain 3 elements, when we want a 4-element affix, we should simply store the fourth element in the finger level and not change the 3-branch. I'll refer to this separately-stored overflow value as the "slop" value.

We must deal with the degenerate case of the single-element finger level. We will store this single value in the left overflow slot for that finger level.

Since we no longer have a finger "level" as a result of putting the affixes in spines, we will have to store these overflows in their own spines. We'll refer to these as "slop" spines. We will be able to use a space-saving optimization taken from hash array mapped tries (abbreviated HAMTs), which will be described in the next section.

This doesn't complicate the rebalancing any. In fact, it allows for a cache-oblivious rebalancing that requires much less pointer chasing, as the rebalancing section will describe.

Slop Spine Compression

Many of the fingers will not have overflows. In fact in a finger tree made from concatenation will only have overflows at the bottom (see "nodes" function from the reference article above to see why). Since having many frequently-changing, mostly-empty arrays can lead to a lot of bloat in a purely functional persistent structure, I chose to use the popcount-based array compression technique.

A slop spine will contain at most 32 elements. If the spine exceeds this (good god, the finger tree would have to contain 22236242266222088 elements) then a linked list of arrays can be used. A spine will contain no null values.

Each spine will have an integer bitmap associated with this. This bitmap will contain a 1 where there is a slop value in the index of that bit offset and a 0 where there is no value. Accessing the n-th slop value involves masking out the bits above the offset, finding the number of ones below that offset (x86 has a popcnt instruction for this, but there are other rank/select structures for this), and using that offset in the array. We've gotten null-freedom at the cost of a single word. Furthermore we can represent a null array by simply making the reference to the array null.

Since the spines are small anyways, is this worth the effort? We also know that if we have a sequence of ones at the start of the bitmask, that we will overflow until the next 0. The index of the first finger that doesn't overflow can be found by flipping all bits in the bitmask and finding the leading 1 (previously the first non-overflowing value or the first 0). There is an x86 CPU instruction for this, CLZ. Many compilers and CPUs have optimizations for this and intrinsics. This is fast. We'll see below why knowing how many finger levels will overflow is important.

In short, this is a worthwhile optimization.

Parallel Rebalancing

Conventional finger tree rebalancing is a process with very many sequential dependencies. A tree is checked and is found to have four items, two new trees are made (the one for the current level and the one for the next), the next level is appended to (possibly overflowing and repeating this), and then the
future version of the current level can be returned with all of the references set. On actual hardware, this runs into a more fundamental problem than poor cache behavior.

Computers have largely gotten faster due to more and more Instruction Level Parallelism on modern CPUs. Instructions that do not alter the same memory locations or registers will be run at the same time, since logically they can commute without altering program execution. When you have a long sequence of instructions in which the product of one instruction is necessary for the next, such as the one outlined above, the pipeline of instructions executing at any one time will come to a screeching halt. You'll have many CPU cycles in which large portions of your possible resources have to sit unused.

Restructuring an algorithm into one in which logical anti-dependencies do not have incidental data dependencies can lead to massive performance gains. Furthermore, if one wants to makes sure that their parallelism is exploited and they know that operations will be identical across their data, they can turn to the Single Instruction Multiple Data operations that are present on modern CPUs. SIMD operations save the CPU from messy heuristics, and even free up registers for other instructions to use (SIMD instructions use a different set of registers).

If you have a chain of rebalancing that needs to be done, only the non-overflowing case will require different logic from the others. Logically what happens in the new array layout is that the value at finger level N and the slop value from finger level N+1 are made into a new 2-branch at level N+1. The top level contains the top slop and the newly-inserted value.

Since a chain of overflows requires a sequential list of slop values, we can simply use parallel arrays with an i+1 indexing the slop array and an i for the old-affix array to create the new 2-branches. These addresses can then be copied into the relevant fields of the new version of the relevant affix spine. We can do all of these operations with the commonly available x86 SIMD instructions. Since spines should be relatively small, all of the rebalancing will likely take a single SIMD loop followed by the non-overflow case and the structure finalization.

Since we need to compute the annotations for all of these new nodes referred to by affixes, and since this annotation must be a monoid and thus only dependent upon the annotations of values in the node, we can compute all of these in SIMD parallel as well.

Maintaining the bitmask structure is also surprisingly independent of the data we're generating. We first add slop values at the root, and only push them downwards if they overflow. We already know how long the chain of overflows will be, so we can mask all of the slop values that had overflowed to 0, and need only to check if the non-overflow case will yield a slop value or not.

We see that we've opened ourselves up to a world of compiler and CPU optimization. We've gone from a sequential tree modification algorithm to a highly parallel array processing one with explicit anti-dependence between steps.


Split works by moving through the finger levels and chasing the either left affix, the next finger level, or the right affix based on which has the value that causes the predicate to go from True to False. It has a really clever micro-optimization to keep access O(1). Each finger level stores the cumulative measure from the bottom. This lets us know that if the prefix's measure and the next level's measure combine to a value that does not cause the predicate to become true, then the value is in the right affix at the current level. Therefore conventional finger trees descend only the strictly necessary number of levels when traversing. The current algorithm has some problems for us though.

Firstly, the conventional algorithm has been broken by the fact that we use a slop slot rather than chasing the annotation in the affix. Since this value comes on the left though, we can simply perform another comparison and treat this as a constant overhead per level. It's not free, but it's cheap since the finger spine is relatively small compared to trees at the lower levels. We'll focus on making those cheaper.

Once we've found the finger level and affix that has the split point, we need to traverse our 3-level tree to find the point. This is where we start jumping around the cache unwisely. Conventional finger trees fetch the node and then decides if the measure change happens there. That is, the finger tree pays the cost of a cache access before knowing whether it needed the data at all. We make the observation that in a given finger tree there will only be one reference to a 3-branch at a given time. While a persistent data structure may have many, many versions referring to the same 3-branch, this is not the common case.

We will store the measure of a branch3 alongside it's reference. With the restriction that a measure be a word, this should be able to fit in a doubleword. We've suddenly added a factor of two to our 3-branch node sizes and our affix array sizes (although each 3-branch now doesn't have to store it's own measure), but have gained the ability to only chase the pointers that we need to.

Memory Management

Manual memory management for purely functional structures can be very taxing and inefficient because of the access patterns for persistent structures. Some substructures will have an incredibly large number of incoming references and others will live for an incredibly short time period. Finger trees are constantly criticized for having unintuitive memory management patterns and requiring a tracing garbage collector. We'll address the issues, but first let's introduce some terminology and properties.

We'll refer to a block of memory allocated at once as an "allocation cell". Memory management tends to involve either having reference counting, a tracing collection system, or a system that manages lifetime semantics explicitly. The latter is error-prone, being the cause of most catastrophic software failures in the last few decades. It can also lead to an unacceptable amount of explicit bookkeeping. Tracing and reference counting, both known as "garbage collection" systems, are our best bets here.

Tracing consists of finding all of the live "root references," in this case the live references to versions of our finger tree, and marking the transitive closure of those roots as alive. Objects not marked as alive can then be freed. In practice, most systems will use this memory to satisfy the next allocation, rather than free it. When allocations are different sizes, there is a problem of fragmentation. Most collectors get around this by moving objects and fixing references. This has the problem of having a lot of memory traffic, essentially busting the cache. If a finger tree operation required a moving collection, every single one of it's old in-structure cache lines are suddenly useless. These immutable objects have to be re-fetched from memory, which can be pretty wasteful.

Reference counting consists of having each allocation store the number of incoming references, and modifying the structure to account for these changes. The cost of storing the reference count can be done with a single unsigned word-sized integer. If it couldn't then the number of live objects would have to extend the reachability of a pointer. If there is a reference cycle, that is if there is a cycle in the object graph, then reference counting cannot free the memory. This is reference counting's big failure state, along with heavily-mutated objects. But finger trees don't actually have this problem. As our allocations are immutable, it is impossible to create a reference cycle. They form a logical vector clock as a reference must come from an earlier allocation, and so references have a total ordering.

Reference counting is desirable because it is prompt. If a very large structure becomes garbage, the system's memory usage will not go down until the tracing collector has run and freed the memory. In a system where memory is re-used rather than freed and objects are not moved, a worst-case scenario could prevent the memory usage of the application from ever going down. This is barely acceptable for a language's garbage collector, it is not acceptable for a lightweight data structure library.

Memory management can be done by using any off-the-shelf reference counting system. An important property to prevent heavy modification of reference counts is to only hand out references to the root of the structure. The allocations done by the data structure will only have references inside the structure, keeping the amount of modification that must be done to a minimum. My reference implementation will use C++'s shared_ptr.


Now that all structural changes have been described, I thought that diagrams might be useful.


The tree above has the first three levels as nodes with more than a single element, and the fourth level has a single element (overflowed to left "slop" slot). The bitmask shows that the right slop array is empty and the left one has an element in index 1 and 4. An empty slop array is written here as an allocated zero bitmask, but the root reference to it will just be null in real life. There's no need to allocate a 1-byte array to hold a null.

FullNode at depth 1 has child elements that contain the datatype that the tree holds.


You can follow my work implementing this structure here:

Actual benchmarks are forthcoming. I intend to use as it has a history of use in analyzing cache-oblivious algorithms.

Monday, October 5, 2015

Strings and Succinct Data Structures

This last week I covered the "strings" section of the MIT lecture and the first of the two "succinct" data structure lectures. 
The strings section presented many of the problems that have motivated the succinct data structures that are covered in this course. I found the succinct data structure section very interesting. There are a few key ideas which underly a lot of the succinct data structures. Looking briefly into some of the papers mentioned shows this to be a sub-field with a history of mental gymnastics.
I'm quite a fan of the balanced parenthesis representation, as it represents the shape of the tree without naming its interior nodes. This is interesting, as it allows for memoization and other optimizations to arise out of the exposed generality.

Tuesday, September 29, 2015

Integer Data Structures + Hashing Notes

The last week I covered the fourth integer data structure lecture and the hashing lecture. The hashing lecture mostly pertained to hash tables, with a focus on collision management and hash suitability.

Monday, September 21, 2015

Integer Data Structures

This previous week in my advanced data structure independent study I covered integer data structures applied to the predecessor problem.

The predecessor problem asks for the predecessor or successor element for a given key, even if there is no value in the data structure with that key. This problem is actually fairly difficult if one considers that key/value mappings such as hash tables are no longer applicable. Likewise, many problems can be decomposed into special cases of the predecessor problem. 

"Integer" data structures actually don't involve properties of the mathematical definition of integers at all. They are a model of computation that is closer to the actual machine than a pointer machine. Where a pointer machine only has references, an integer data structure will typically exploit properties of pointers and the ability to encode information in machine-word-sized bit vectors. 

The Van Emde Boas structure and Fusion Trees are introduced in these three lectures, as is the trick of "indirection" for keeping algorithmic bounds low by routing to chunks of information. The proof of the running times of these structures is quite mind-expanding, as it treats the algorithm as a communication between the data structure and the reading party and uses information-theory to prove a lower bound on communication. 

Fusion trees are very interesting, using "sketching" and bit vectors to add parallelism to the tree search. This entire section has been a collection of algorithmic "tricks" that have a lot of power for optimization.

My notes for the lectures are in the blog post body after the break.

Sunday, September 13, 2015

Memory Hierarchies

This last week, I covered both of the Memory Hierarchy lectures. I chose to skip over dynamic optimality, as tree optimality should be covered later in this study with more of a focus on the papers of note which have been published.

My notes for both Memory Hierarchies lectures have been included below:

Monday, September 7, 2015

Relaxed Radix Balanced Trees

RRB trees are an implementation of persistent ordered sequences with O(1) time for all operations. The paper is fairly straightforward. The existence of a reference C implementation is nice.

Higher-Dimension/Geometric and Kinetic Data Structures

While at ICFP this last week, I covered the second week of MIT's

My condensed notes have been typed up below. 

Thursday, August 27, 2015

Mutability and Persistent Data Structures: Retroactivity

Intro: Mutation-based Optimization of Persistent Data Structures

The beginning of MIT's 6.851 focuses on persistent data structures, but it takes an interesting focus. Rather than enumerating a number of persistent data structures, it describes the spectrum of persistence ranging from partial to fully persistent data structures. I'm going to report on the most novel part of this lecture compared to other material in my study, which is the use of mutation for optimization of partially persistent structures. 

Monday, August 24, 2015

Algorithms Independent Study

I am beginning an independent study in advanced data structures, with a focus on persistent and locklessly concurrent data structures.

The syllabus will be largely taken from the MIT course of the same topic:

It will also make use of Okasaki's Purely Functional Data Structures (

A combination of these resources and a large collection of whitepapers will provide me with a corpus to study.

The record of my progress will be a series of writeup posts with the algo_study tag on this blog. This post should have this tag, which you can click on below to see subsequent posts on this study.

I will also speak at my supervising Professor's group. I'll try to get a record of this for this blog.

Saturday, August 15, 2015

The Bugs We Have to Kill

This article is a pretty nice read, I think, because it manages to unite exploit development, Floyd-Hoare correctness proofs, and parsing.

It makes really good points. Almost all of the correctness-assistance tools that I've seen rely on preconditions holding before the function call. Sadly, if an exploiter can use return-oriented programming then these preconditions can be broken at run-time even if they can be proven correct at compile-time.

Thursday, August 13, 2015

On False Sharing and DRY

I know many new developers who have learned a singular metric for code quality, and that is the amount of repeated code. The acronym DRY(don't repeat yourself) is repeated frequently.

After having worked on code that was far too DRY in my webdev days, I unhesitatingly believe that this is a poor solitary attribute to emphasize. What do I mean by too DRY?

- Two diverging control flow paths which happen to share the same way of computing a value will be made to merge just to deduplicate that value
- Functions which happen to perform one desired side effect as a result of computing others are used, and the return value discarded or other effects "reversed" somehow.
- An operation repeated between 2-3 branches such as OR-ing a value with a bitmask and printing it will be broken out into it's own function despite the fact that this code fragment *means* nothing in isolation.

Conciseness is all well and good, but a developer ought not try to compress the program he is working on. In making a manual huffman coding of your code, you make it so that all of your hard-won jankiness must be undone in order to edit the code at all. In order to debug it, one-liners must be expanded to print intermediate results and functional approximations to your obfuscation must be made so *those* can be plugged in and debugged instead.

If you've ever tried to make an internal, complex function into an external one in order to achieve a simple mutation of state with it that you already have access to, you are making the next developer's job harder.

Equally as important, you make it as difficult for the compiler to understand what semantics to preserve as the next guy. Your slow code doesn't need more deduplication and more branches, it needs your fast path to be a straight line.

For this reason, I think that estimation of code quality should emphasize things like consistency of design and local understanding over nearly arbitrary considerations such as whether the same string is reversed in multiple diverging paths of control.

Aot Uber Alles

Despite popular misconception, a good jit nowadays tends to outperform ahead of time compilation. Despite this, it's definitely possible to get good AoT performance, as this talk shows. AoT also has many benefits such as reduced startup time.

Wednesday, August 12, 2015

Safe Free Monads!


I won't be summarizing this one for you, but the ideas are very cool. It brings space-bounds as a form of safety to Free Monad instances.

Free monads are a useful tool for abstraction, separating specification
from interpretation. However, a naive free monad implementation can lead
to stack overflow depending on the evaluation model of the host language.
This paper develops a stack-safe free monad transformer in
PureScript, a strict Haskell-like language compiling to Javascript, and demonstrates
certain applications - a safe implementation of coroutines, and a generic
mechanism for building stack-safe control operators.

Haskell on a Shared-Memory Multiprocessor

The fact that Haskell's parallel programming capabilities work so well is a minor miracle. I didn't quite appreciate it until understanding the spineless tagless G-Machine. Haskell's model of computation, the use of repeated pattern matching to expand and reduce a runtime stack seems particularly poorly suited for parallelism.

Purity does decrease the penalty of duplicated work, but concurrent evaluation of thunks means that the risk of duplicating work is pretty high. The opposite extreme is even more unacceptable. Use of CAS, or locking even, would add an overhead to every evaluation that in sum is just too much.

The mentioned paper ( details the cleverness behind this feat of engineering.

Performance demands we simply avoiding locking entirely in the case that evaluation does not use unsafePerformIO. This leaves us with the problem of mitigating the possible performance losses due to duplicate evaluation.

Blackholing is used to signify that a thunk has been entered, and that it should not be re-entered by another function. Because of the need for CAS primitives, a clever way to reduce the frequency of it is used. Every once in a while a function will scan the stack of values that it is currently evaluating, and will CAS in the black hole. When another function tries to enter this black hole, it simply blocks on A. If the running thread finds out that it's inside of a function already blackholed, it can truncate the stack and discard items newer than that blackhole. This ensures that probabilistically it is the longer-running and thus more expensive evaluations that will not be duplicated.

This can be further optimized by writing in a "grey hole" on entry to the function in a lockless way. This doesn't prevent all work-duplicating races but it does prevent ones that happen after a memory fence. This shortening of the race's time window decreases the amount of duplicated work. Unconditional greyholing does have a performance cost though, requiring a write on entering even the cheapest evaluations.

Saturday, August 8, 2015


The superscalarsort whitepaper by Agarwal is a good read.

It's a pretty novel idea. Modern superscalar processors have the ability to gain massive throughput increases when they can correctly predict branches. The issue is that many conventional sorting algorithms don't have conditionals that are overwhelmingly either true or false. If you picture QuickSort, the optimal case has the conditional (if the current element is larger than the pivot) true 50% of the time. This is the worst-case for modern branch predictors.

Superscalarsort avoids the branch prediction problem by eliminating conditional branching probabilistically. After partitioning the data into buckets, a number of tricks are used to make a radix sort not require any comparisons unless there are ties. There is a 0.002 probability of a tie given random numbers. The conditionless moves are very amiable to ILP as well.

I wonder if this 1996 algorithm might find new life on modern GPUs and vector processors. For these massively parallel processors, a conditional with 50/50 probability will see half of the computation threads sit idle in each branch. Conditionals are incredibly expensive in a SIMD world.

Ode to Assert

Oh, assert ();
my oldest friend.

I know quite well where you
started, and my work began.

Your name is cheap and often spoken,
a promise, a guard against promise broken.
When oft I scan and search about,
you accompany my forefathers' many a shout.

You are born of my fear,
our uncertainty,
and the author's doubt.

You accompany many reports of error,
reminding forever,
how haste and castigation come together.

You're an oath overseer awoken,
with the duty to report that oath broken.
Your shout that you ought not to have had to visit me,
is sharp. I worry you may actually resent my company.

I seem not alone,
for yours is the face that cut short a thousand thousand ventures,
builds killed in the cradle from code pristine.

And now at home,
I can read of a thousand more exciting adventures,
to replace you with other devils still unseen.

Forgive them brother,
for they know not what they do,
for they curse themselves,
when they curse at you.

For you are a mirror,
not a jailer,
you do nothing that I do not request,
I am a fool to blame you for my past behest.

Cast armchair blame we may,
but the truth is as clear as day,
it is only when I fault that you wail,
it is only when I fail,
that you fail! ();

Sunday, August 2, 2015


Thought I'd induce some nightmares with Google's deepdream.

It's Worse Than You Think

On why you cannot reason about data races:

Many people reason about lockless/CAS-less code and try to consider how a race condition can be made to be harmless.

If one is lucky, races can lead to duplicated work in the worst case.

Ex: = new list(1, 2, 3), x.length = 3; = new list ();
x.length = 0;

The problem arises when there are multiple related references that must be changed, or values which rely upon one another. The interleaving of threads can cause random absurdity. "I can still handle a race in that example," you say: "I'll figure out at run-time if the data is inconsistent and wait until it is. If the list is empty and the length field is larger than 0, wait."

Modern superscalar CPUs will dispatch the instructions to execution units in-order but execute them out-of-order. Where a lack of dependency between two instructions can be guessed by the CPU, it doesn't promise any ordering between them. Compilers can behave similarly when aggressively optimizing; this is why modern C is performant. The problem is that across threads, the order of reads and writes can be arbitrarily changed as long as they are correct with respect to others in the same executing thread.

In your racy list example, the length can be set to 0 before the data is set to the new list. That's right, the order that you see on the screen between synchronization primitives means very little.

To really understand this problem, I cannot recommend the following paper from the POPL conference heavily enough. It's a pretty good description and proof of what we know the consistency guarantees of x86 concurrency to be.

It's worse than you think.

The quote from this paper that best bursts the Dunning–Kruger effect is:


Memory consistency in multithreaded applications that is not bought by synchronization primitives and memory fences cannot be had for free. Furthermore, the worst failures had by eliding these primitives will not happen at compile-time or test-time, they will be black swan events dependent on everything from network speed to the ambient solar radiation. 

Wednesday, July 29, 2015

New Blog

I've created a new blog as a way to describe Computer Science whitepapers while I read them. Coverage will vary from topical to the depth of a close reading. I will be pulling from the rapidly growing collection of works in the fields of data structures, lockless concurrency, formal verification, type systems, and compilers of all types.