Tuesday, October 27, 2015

Hash Consing with Modern Processors, Your In-Cache Cache

If you have
a '*' 4 == b '*' 4
then you know that a must be equal to b.
So problem solved, right? Well... You carry this messy prime number everywhere. It makes everything more complicated. For example, you cannot work with all 32-bit numbers. Both 0x7FFFFFFF and 0 are zero. We have 0x80000000 == 1 and so forth.
What if there were prime numbers that are powers of two? There is no such thing... when using regular arithmetic... But there are "prime numbers" (called "irreducible polynomials" by mathematicians) that act a bit like they are powers of two when using carry-less multiplications.
With carry-less multiplications, it is possible to define a modulo operation such that
modulo(a "*" c) == modulo(b "*" c)
implies a == b. And it works with all 32-bit or 64-bit integers.
That's very nice, isn't it?
Up until recently, however, this was mostly useful for cryptographers because computing the carry-less multiplications was slow and difficult.
However, something happened in the last few years. All major CPU vendors have introduced fast carry-less multiplications in their hardware (Intel, AMD, ARM, POWER). What is more, the latest Intel processors (Haswell and better) have a carry-less multiplication that is basically as fast as a regular multiplication, except maybe for a higher latency.
.....

That's right: CLHash is much faster that competing alternatives as soon as your strings are a bit large. It can hash 8 input bytes per CPU cycles. You are more likely to run out of memory bandwidth than to wait for this hash function to complete. As far as we can tell, it might be the fastest 64-bit universal hash family on recent Intel processors, by quite a margin.
http://lemire.me/blog/2015/10/26/crazily-fast-hashing-with-carry-less-multiplications/


This great blog post by Daniel Lemire inspired me to put together something about the idea of hash consing in purely functional data structures. It is an optimization that doesn't get enough attention in literature designed for non-experts.

The idea behind hash consing is that in a system with high data throughput (which is most functional languages) one will frequently re-create substructures that are identical to live or recently-freed substructures. For example, one frequently ends up calling two different functions in helper libraries that each need to reverse a list. For a large linked list, this can occupy a lot of space.

In a functional language, a program's control flow is usually dictated by the traversal of a data structure. Computing the first n powers of a given prime, for example, is equivalent to traversing a list of the powers numbers of length n. In a lazy language, a reference to the (n-1)-th value would have a reference to a "thunk", or unevaluated closure, that when called will produce the value. This reference need only be evaluated once; but an arbitrary number of copies of the (n-1)-th power can be computed by making multiple lists with the same base prime. We can duplicate work unnecessarily by evaluating all of these thunks.

There are an entire class of algorithmic and compiler micro-optimizations to avoid this overhead. A fairly general and simple method for doing so is to make the allocator de-duplicating. In a pure language, the result of allocation will not be mutated. Therefore two requests to create the same sub-structure can be given the same reference without any alteration of program semantics or control flow. This is done by hashing the structure to create and checking if it has already been created. If so, the old reference is returned. This creates a pervasive cache throughout every level of the application.

While simple to explain, there are a lot of nuances to this optimization. We need to free structures at some. Let's consider the simple case that we don't keep any substructures that are unreachable from the program. It's only a de-duplication of live objects. When the last reference to a substructure goes out of scope, the cache must be updated. This is typically implemented with "weak references" that are allowed to be made nullable by the garbage collector.

We must choose to either place a finalizer on every object (expensive) or allow for false positives in the cache. False positives are relatively cheap when encountered in the query, since we'd have needed to find the slot to fill with the reference either way. False positives do require a "garbage collection" step for the cache though, where the weak references that have become null are removed from the structure.

The classic reason to avoid hash consing is that it has classically been expensive to compute a hash that is unlikely to collide. The above article brought to mind the fact that modern techniques have made hashing much cheaper. Patrascu and Demaine have shown that simple tabulation hashing works better enough to be game-changing. Alongside hardware primitives, I believe that the cost of hashing has become cheap enough to make revisiting techniques such as hash consing worthwhile. In the worst case, it adds a constant overhead per allocation and provides no gain. In the best case, it performs optimal deduplication of both computation and intermediate data structures.

Hash consing has interesting cache trade-offs. It can keep a data structure small, which improves overall cache behavior due to lower memory pressure. On the other hand, objects allocated at the same time may not be near each other now. An interesting idea in the vein of Demaine's Tango Trees would be to move the deduplicated structure to the most recent allocation site, and move it back on free. For deep structures this has high cost, so heuristics are necessary. Alternatively, the technique could be used to only return objects that are likely to be in the cache. A cache-oblivious hash consing technique seems like an easy win, preventing allocation from dirtying cache lines where possible.

No comments:

Post a Comment