Intro: Mutation-based Optimization of Persistent Data StructuresThe beginning of MIT's 6.851 focuses on persistent data structures, but it takes an interesting focus. Rather than enumerating a number of persistent data structures, it describes the spectrum of persistence ranging from partial to fully persistent data structures. I'm going to report on the most novel part of this lecture compared to other material in my study, which is the use of mutation for optimization of partially persistent structures.
The first lecture moves straight to a coverage of ways to optimize persistent data structures using mutation. This falls under the category of partial persistence, where previous versions can be queried but only the "most recent" copy can be modified. This differs from what you'll see in Okasaki's book or many other sources for persistent data structures in that it allows changes to the underlying data structure while ensuring that the data structure appears persistent and immutable to API consumers.
The main strategy presented for partial persistence is to use a mutation log. This is a good introduction for some of the patterns of update tracking used by retroactive data structures. The mutation log strategy boils down to storing the modifications in multiple ways rather than directly modifying the data structure.
For each object, there will be a read-only region containing the original data in the object. Whenever a modification is made, this modification will be inserted into the "modification log" along with a record of the version that was modified. Reading the field then requires a traversal of the modification log. This is kept O(1) by handling log overflow by recreating the object. When the modification log overflows, the object is copied and all of the fields in this new object is initialized with the value of those fields after all of the updates are applied.
Copy-on-overflow introduces a problem with incoming references. If nothing is done, they'll point to an old copy of the data structure. The solution is to make sure that all of the incoming references are set to the new object. It is for this reason that this strategy of partial persistence requires that objects track incoming references, which also requires a hard cap on the number of incoming references per object. If the number exceeds this amount, a copy of the referred-to node is made. For applications which take advantage of pointer equality, modification-log-based partial persistence can present a significant hardship.
What happens when the "persistent" part of the data structure is actually taken advantage of? That is, what happens when version A has multiple updates, creating two diverging views of the data structure? This is technically unsupported by partially persistent data structures, but a modification to our information-tracking can handle it. If we make each forward and back-reference maintain a version, then we know what the incoming reference expects a copy to maintain.
The modification log will contain the history from these two new different "most recent" versions until an overflow happens. On an overflow, one copy will be made for each diverging version and the references will be forwarded to one version or another, based on which version the incoming reference holds.
This, I fear, is where you get into the problem of potentially sacrificing the practical benefits of purely functional data structures. The use of a modification log is an optimization for the common-case, in order to amortize the time and space cost of copying on modification. Unfortunately, you've taken the version bookkeeping out of the hands of the garbage collector. Until the modification log is overfilled, garbage data will remain in memory. It's not hard to imagine a pathological worst-case for the space usage where almost all objects have an almost-full modification log.
There also lies the fact that back-reference updates and manual memory management could race in a way that could be thread unsafe; it is probably not difficult to incorrectly use such mutation-based structures.
The MIT course covers this (http://courses.csail.mit.edu/6.851/spring14/scribe/lec2.pdf ), but I'm going to use the whitepaper by the professor on this topic (http://erikdemaine.org/papers/Retroactive_TALG/paper.pdf), as it's a better in-depth description of what is novel to the method.
What do we really mean when we refer to a data structure as partially or fully retroactive? A partially retroactive data structure is one with the limitations described above; one may only query the state of the structure at time "now." A fully-retroactive data structure in contrast is one where you may query the state at any time. Both classifications allow modification at any time.
What does this even mean, to modify a structure in the past? Recall the time confusion trope in popular culture, where the past may be changed in a way that alters the future. In a similar way, picture a taking a queue which has had 10 enqueues followed by one dequeue, and removing all but one enqueue. If these 9 other enqueues never happened, you'd very likely have a different resulting structure when the algorithm terminates.
A retroactive data structure is not a persistent data structure. A persistent data structure says that an alteration at a given time in the past makes the entire world different from then on in an irreconcilable way. A retroactive data structure says that an alteration at a time in the past will change the state of the world we're in right now. A good analogy is the use of the "rebase" command in the git version control system. When we alter a commit far in the past, our AST must manually be made to properly integrate that change at every following commit.
Where is this useful beyond novelty? The paper offers a list of uses. One of the more interesting ones is:
Personally, I believe that time-travelling debugging in the case of difficult-to-reproduce bugs is a compelling enough case for this technology.Tainted Sources.In a situation where data enters a system from various automated devices, if one device is found to have malfunctioned, all of its transactions are invalid over a period of time and must be retroactively removed. For example, in a situation where many temperature readings from various weather stations are reported to a central computer, if one weather station’s sensors are discovered to be intermittently malfunctioning, we wish to remove all of the sensor readings from the malfunctioning station because they are no longer reliable. Secondary effects of the readings, such as averages, historical highs and lows, along with weather predictions must be retroactively changed.
The bad news is that there is no general way to take a data structure and to make it efficiently fully-retroactive. Time is not another dimension to the data, because decisions made at later times depend on the data structure's state at a given time. Invariants must be maintained, and this frequently cannot be done by simply applying transformations to one piece of data and then combining it with the current state of the world.
Consider some contrived application which applies a noninvertable operation on a single cell,and writes the return value back into that cell in a loop. If you want to be able to change the starting value of the cell, or a value at a random point in time, the data structure would need to be changed. Value(t=1) depends on Value(t=0). You'd need to go from O(1) space to O(m) space, where m is the number of updates, since the state at every update must be saved. This is unreasonable in many applications.
The good news is that data structures can be made efficient on a case-by-case basis by taking advantage of the patterns of data use. Also, partially retroactive structures can exploit commutativity and invertibility for great time savings. Adding data or removing data in the past will have the same effect as doing it in the present. Furthermore, the data removal can be achieved by simply applying the function in reverse.
Imagine a data structure which collects a running sum. Retroactively adding or removing values is as simple as adding or subtracting the value from the sum. Being able to query the sum at a given time would require an ordering to all updates, suddenly in O(m) space territory. Undoing a number of sums would require traversing this list of updates. In general, we suddenly have O(m * O(inv)) access where inv is the inverse of the operation. In short, partial retroactivity is easier in the general case.
Full retroactivity has a "sweet spot" when the problem domain is decomposable search. Decomposable search problems are ones in which a query can be split into a query on subsets of the original data, and the result of any two sub-queries can be combined to arrive at the final result. In short, they have commutativity and inversions. In pseudocode:
S = A union B
find (pred, S) = find (pred, A union B) = combine (find (pred, A), find (pred, B)This necessarily implies that the collection is an unordered set. We impose the restriction that the lifetime of an object in a data structure starts with it being created, and that it alternates between insertion and deletion (trivially true since most structures don't support double-deletes or double-additions. ) Let's consider how to represent this in a fully-retroactive way. The running-time proofs are in the paper, I will sketch out the heart of the implementation.
The heart begins with a description of another data structure called a segment tree. Consider a time-line with each lifetime of each object as an interval. This can be efficiently stored as a balanced binary search tree by considering each lifetime interval as a leaf node. The interior nodes then have the invariant that they represent the time intervals that their children span. The height of this tree is O(log(m)) where m is the total number of updates performed. Interior nodes contain data structures allowing for efficient search of the specific attributes the decomposable search checks for.
A modification to this tree then requires at most O(log(m)) changes to these structures. A retroactive query then simply involves querying the data at each of the O(log(m)) sets and then combining these results. It's easy to see how this is significantly more efficient than using an O(m) roll-back and replay for each change, and a worst-case O(m) roll-back to query the structure.
Let's consider an implementation of an efficient, simple, and fully-retroactive queue. A queue has the rule that a removal must get the oldest object inserted which has not yet been removed. Looking at our data structures mentioned so far, it's easy to see how a specialization could work for queues. Consider a separate order-statistic tree (trees that assign a "rank" to each value and allow the querying of the value with the ith-smallest "rank") for the collection of enqueues and for the collection of dequeues. The computed "rank" for the dequeue tree is the time; for the enqueue tree it is the number of elements that were inserted before it. The enqueues would thus form an "append-only" queue, and dequeuing becomes a matter of finding the offset into the queue.
You can find this offset by querying the dequeue tree for the number of dequeues before the time in question. Call this d. You would then query for the value in the enqueue tree with rank d+1. This would be the head of the queue after d dequeues. Stop and consider how much more efficient this is than the naive roll-back implementation.
The paper then gets O(1) current-version access "for free" by saying:
Using balanced search trees supporting updates in worst-case constant time [Fleischer 1996], and by maintaining pointers into the trees to the current front and back of the queues, updates and queries at the current time can be supported in O(1) time.
I invite you to read the paper for a coverage of the retroactive implementations of other common data structures. Notes for my readings of the priority queue section are below the rest of this post.
Throughout all of the discussion of retroactivity so far, the bookkeeping discussion ended at the data structure invariants. This significantly decreases the usefulness of these structures by requiring that the use of the values returned by queries not require a semantic consistency with the results of future queries. If you insert an element at time t=0, every query from t=0 to now technically returned an incorrect result. In situations with cumulative state, this is problematic.
The solution is to store the queries themselves into the data structure, so that the invalidated return values can be found and processed one-by-one by rerunning the queries at the times that these "errors" occurred and executing the API-consumer code that depends on these query results.
I've only found one retroactive algorithm implementation intended for a general audience. It targets python, and can be found here: https://github.com/csvoss/retroactive Conclusion The place to go from here if you are more interested in retroactive data structures is definitely the canonical whitepaper that I have linked above. The paper "Non-oblivious Retroactive Data Structures" by Acar, Blelloch, and Tangwongsan is definitely important if you intend to implement the feature. I also recommend an analysis of the supporting data structures that you intend to use, the performance of a lot of these algorithms are heavily dependent on the segment tree implementation.
Priority Queue Notes:
-- Priority Queues
--- delete-min is what makes it non-commutative
--- <Insert timeline diagram for priority queue>
--- Maintaining deleted elements expensive, instead handle inserts
--- Bridges are kind of "coordination points". A time t is a bridge if the collection of elements alive at time t is a subset of the elements available now.
--- Seperate the chaotic chain reactions
-- Need 3 data structures to use bridge:
--- Q_now is a balanced binary search tree, changed once per update
--- Balanced BST where leaves are the insertions ordered by time
--- A third balanced BST of updates, ordered by time, with a sign to show if is part of the set of available things now or a delete-min.
-- How to use these data structures:
1) On insert at time t, find bridge t' right before that.
2) When the third BST has a prefix summing of 0(explain weights), it is the update of a bridge. We can find bridge in O(ln).
3) Use BBST of insertions to find the maximum node not already in Q_now. O(ln)
4) Update data structure after making update. (ex: Priority queue, if inserted element is larger than any of removed than change what was removed else just add to Q_now.