Thursday, August 27, 2015

Mutability and Persistent Data Structures: Retroactivity

Intro: Mutation-based Optimization of Persistent Data Structures

The beginning of MIT's 6.851 focuses on persistent data structures, but it takes an interesting focus. Rather than enumerating a number of persistent data structures, it describes the spectrum of persistence ranging from partial to fully persistent data structures. I'm going to report on the most novel part of this lecture compared to other material in my study, which is the use of mutation for optimization of partially persistent structures. 

Monday, August 24, 2015

Algorithms Independent Study

I am beginning an independent study in advanced data structures, with a focus on persistent and locklessly concurrent data structures.

The syllabus will be largely taken from the MIT course of the same topic:

http://courses.csail.mit.edu/6.851/spring14/lectures/

It will also make use of Okasaki's Purely Functional Data Structures ( http://www.amazon.com/Purely-Functional-Structures-Chris-Okasaki/dp/0521663504).

A combination of these resources and a large collection of whitepapers will provide me with a corpus to study.

The record of my progress will be a series of writeup posts with the algo_study tag on this blog. This post should have this tag, which you can click on below to see subsequent posts on this study.

I will also speak at my supervising Professor's group. I'll try to get a record of this for this blog.

Saturday, August 15, 2015

The Bugs We Have to Kill

This article is a pretty nice read, I think, because it manages to unite exploit development, Floyd-Hoare correctness proofs, and parsing.

It makes really good points. Almost all of the correctness-assistance tools that I've seen rely on preconditions holding before the function call. Sadly, if an exploiter can use return-oriented programming then these preconditions can be broken at run-time even if they can be proven correct at compile-time.

Thursday, August 13, 2015

On False Sharing and DRY

I know many new developers who have learned a singular metric for code quality, and that is the amount of repeated code. The acronym DRY(don't repeat yourself) is repeated frequently.

After having worked on code that was far too DRY in my webdev days, I unhesitatingly believe that this is a poor solitary attribute to emphasize. What do I mean by too DRY?

- Two diverging control flow paths which happen to share the same way of computing a value will be made to merge just to deduplicate that value
- Functions which happen to perform one desired side effect as a result of computing others are used, and the return value discarded or other effects "reversed" somehow.
- An operation repeated between 2-3 branches such as OR-ing a value with a bitmask and printing it will be broken out into it's own function despite the fact that this code fragment *means* nothing in isolation.

Conciseness is all well and good, but a developer ought not try to compress the program he is working on. In making a manual huffman coding of your code, you make it so that all of your hard-won jankiness must be undone in order to edit the code at all. In order to debug it, one-liners must be expanded to print intermediate results and functional approximations to your obfuscation must be made so *those* can be plugged in and debugged instead.

If you've ever tried to make an internal, complex function into an external one in order to achieve a simple mutation of state with it that you already have access to, you are making the next developer's job harder.

Equally as important, you make it as difficult for the compiler to understand what semantics to preserve as the next guy. Your slow code doesn't need more deduplication and more branches, it needs your fast path to be a straight line.

For this reason, I think that estimation of code quality should emphasize things like consistency of design and local understanding over nearly arbitrary considerations such as whether the same string is reversed in multiple diverging paths of control.

Aot Uber Alles

Despite popular misconception, a good jit nowadays tends to outperform ahead of time compilation. Despite this, it's definitely possible to get good AoT performance, as this talk shows. AoT also has many benefits such as reduced startup time.

https://www.youtube.com/watch?v=Xybzyv8qbOc&index=15&list=PLX8CzqL3ArzUo2dtMurvpUTAaujPMeuuU

Wednesday, August 12, 2015

Safe Free Monads!

Paper:
http://functorial.com/stack-safety-for-free/index.pdf

I won't be summarizing this one for you, but the ideas are very cool. It brings space-bounds as a form of safety to Free Monad instances.

"""
Abstract
Free monads are a useful tool for abstraction, separating specification
from interpretation. However, a naive free monad implementation can lead
to stack overflow depending on the evaluation model of the host language.
This paper develops a stack-safe free monad transformer in
PureScript, a strict Haskell-like language compiling to Javascript, and demonstrates
certain applications - a safe implementation of coroutines, and a generic
mechanism for building stack-safe control operators.
 """

Haskell on a Shared-Memory Multiprocessor

The fact that Haskell's parallel programming capabilities work so well is a minor miracle. I didn't quite appreciate it until understanding the spineless tagless G-Machine. Haskell's model of computation, the use of repeated pattern matching to expand and reduce a runtime stack seems particularly poorly suited for parallelism.

Purity does decrease the penalty of duplicated work, but concurrent evaluation of thunks means that the risk of duplicating work is pretty high. The opposite extreme is even more unacceptable. Use of CAS, or locking even, would add an overhead to every evaluation that in sum is just too much.

The mentioned paper (http://research.microsoft.com/pubs/67424/2005-haskell.pdf) details the cleverness behind this feat of engineering.

Performance demands we simply avoiding locking entirely in the case that evaluation does not use unsafePerformIO. This leaves us with the problem of mitigating the possible performance losses due to duplicate evaluation.

Blackholing is used to signify that a thunk has been entered, and that it should not be re-entered by another function. Because of the need for CAS primitives, a clever way to reduce the frequency of it is used. Every once in a while a function will scan the stack of values that it is currently evaluating, and will CAS in the black hole. When another function tries to enter this black hole, it simply blocks on A. If the running thread finds out that it's inside of a function already blackholed, it can truncate the stack and discard items newer than that blackhole. This ensures that probabilistically it is the longer-running and thus more expensive evaluations that will not be duplicated.

This can be further optimized by writing in a "grey hole" on entry to the function in a lockless way. This doesn't prevent all work-duplicating races but it does prevent ones that happen after a memory fence. This shortening of the race's time window decreases the amount of duplicated work. Unconditional greyholing does have a performance cost though, requiring a write on entering even the cheapest evaluations.



Saturday, August 8, 2015

Superscalarsort

The superscalarsort whitepaper by Agarwal is a good read.

It's a pretty novel idea. Modern superscalar processors have the ability to gain massive throughput increases when they can correctly predict branches. The issue is that many conventional sorting algorithms don't have conditionals that are overwhelmingly either true or false. If you picture QuickSort, the optimal case has the conditional (if the current element is larger than the pivot) true 50% of the time. This is the worst-case for modern branch predictors.

Superscalarsort avoids the branch prediction problem by eliminating conditional branching probabilistically. After partitioning the data into buckets, a number of tricks are used to make a radix sort not require any comparisons unless there are ties. There is a 0.002 probability of a tie given random numbers. The conditionless moves are very amiable to ILP as well.

I wonder if this 1996 algorithm might find new life on modern GPUs and vector processors. For these massively parallel processors, a conditional with 50/50 probability will see half of the computation threads sit idle in each branch. Conditionals are incredibly expensive in a SIMD world.

Ode to Assert

Oh, assert ();
my oldest friend.

I know quite well where you
started, and my work began.

Your name is cheap and often spoken,
a promise, a guard against promise broken.
When oft I scan and search about,
you accompany my forefathers' many a shout.

You are born of my fear,
our uncertainty,
and the author's doubt.

You accompany many reports of error,
reminding forever,
how haste and castigation come together.

You're an oath overseer awoken,
with the duty to report that oath broken.
Your shout that you ought not to have had to visit me,
is sharp. I worry you may actually resent my company.

I seem not alone,
for yours is the face that cut short a thousand thousand ventures,
builds killed in the cradle from code pristine.

And now at home,
I can read of a thousand more exciting adventures,
to replace you with other devils still unseen.

Forgive them brother,
for they know not what they do,
for they curse themselves,
when they curse at you.

For you are a mirror,
not a jailer,
you do nothing that I do not request,
I am a fool to blame you for my past behest.

Cast armchair blame we may,
but the truth is as clear as day,
it is only when I fault that you wail,
it is only when I fail,
that you fail! ();


Sunday, August 2, 2015

Deepdream

Thought I'd induce some nightmares with Google's deepdream.



It's Worse Than You Think

On why you cannot reason about data races:

Many people reason about lockless/CAS-less code and try to consider how a race condition can be made to be harmless.

If one is lucky, races can lead to duplicated work in the worst case.

Ex:
x.data = new list(1, 2, 3), x.length = 3;
x.data = new list ();
x.length = 0;


The problem arises when there are multiple related references that must be changed, or values which rely upon one another. The interleaving of threads can cause random absurdity. "I can still handle a race in that example," you say: "I'll figure out at run-time if the data is inconsistent and wait until it is. If the list is empty and the length field is larger than 0, wait."

Modern superscalar CPUs will dispatch the instructions to execution units in-order but execute them out-of-order. Where a lack of dependency between two instructions can be guessed by the CPU, it doesn't promise any ordering between them. Compilers can behave similarly when aggressively optimizing; this is why modern C is performant. The problem is that across threads, the order of reads and writes can be arbitrarily changed as long as they are correct with respect to others in the same executing thread.

In your racy list example, the length can be set to 0 before the data is set to the new list. That's right, the order that you see on the screen between synchronization primitives means very little.

To really understand this problem, I cannot recommend the following paper from the POPL conference heavily enough. It's a pretty good description and proof of what we know the consistency guarantees of x86 concurrency to be.

It's worse than you think.

The quote from this paper that best bursts the Dunning–Kruger effect is:




Paper: http://www.cl.cam.ac.uk/~pes20/weakmemory/popl09.pdf

Memory consistency in multithreaded applications that is not bought by synchronization primitives and memory fences cannot be had for free. Furthermore, the worst failures had by eliding these primitives will not happen at compile-time or test-time, they will be black swan events dependent on everything from network speed to the ambient solar radiation.