Sunday, September 13, 2015

Memory Hierarchies

This last week, I covered both of the Memory Hierarchy lectures. I chose to skip over dynamic optimality, as tree optimality should be covered later in this study with more of a focus on the papers of note which have been published.

My notes for both Memory Hierarchies lectures have been included below:


* Memory Hierarchies I
** External Memory Model
*** CPU, fast cache, slow link to big storage
*** B = block size
*** M = total cache size
*** Note: when you request any memory, you get back a whole block
*** Want to use all items in block before eviction
*** In model we have full control of cache.
*** Care about number of memory transfers
**** <= RAM running time, one memory transfer for each access
**** >= (the number of cell probes word reads) / B
*** Basic results in this model
**** Scanning O(ceiling of N/B) to read/write N words in order
**** Search trees = B-trees
***** B-trees with branching factor O(B) support insert, delete, pred search in O(log_(b+1)(n)) memory transfers
****** Since pointers inside each level is a cache jump
***** Theta(log_(B+1)(N)) memory transfers for search in comparison model, so above is optimal
***** Each block read reveals where query fits among B items implies <= log(B+1) bits of info.
**** Sorting: O((N/B) log_(N/B) (N/B)) memory transfers
***** Will not result from using B-trees.
**** Permuting: O(min {N, (N/B) log_(N/B) (N/B})
***** Permute things arbitrarily
***** 2 ideas: sort items, run permutations, construct new copy of array in right order in O(N) time.
**** Buffer tree: Dynamic version of sorting.
***** O( (1/B) log_(M/B) (N/B)) amortized memory transfers for delayed queries.
***** Delayed queries, batch updates
***** O cost find min
****** priority queue: insert/delete at normal time, queries delayed, peek is free
** Cache-oblivious Model
*** Newer, from 1999 from MIT
*** Similar to external memory model but Algorithm doesn't know B or M
*** Can't decide when to evict things, process beforehand.
*** Algorithm looks like a RAM algorithm, easier to read
*** Cache replacement is automatic, don't know line size. We assume it is optimal.
**** Can use LRU or FIFO, very close to optimal. O(1) competitive given a cache of twice the size as future-seeing optimal one.
*** TODO Cache oblivious algorithms are optimal for all B and M, so structure is nice to use.
*** We essentially adapt to different B and M
**** Each level of memory has a different B and M, we optimize for each in an adaptive system.
**** No nice formalization of this goal
*** Basic Results in cache-oblivious model
**** Scanning: Same as external, sequential scan in O(n/B) ceiling
**** Search Trees: insert, delete, search in O(log_(B+1)(N) memory transfers
**** Sorting: O( (N/B) log_(N/B) (N/B)) memory transfers
**** Permuting: min impossible
***** Can achieve old two targets, but can't choose best because don't know B for min equations.
**** Priority queue: O( (1/B) log_(N/B) (N/B)) amortized memory transfers
***** Assumes "tall cache"
****** M = omega(B + (1 + e)), e = 1.00000000001
*** Cache oblivious static search trees
**** Can't use B-trees, don't know what B is.
**** Use balanced BST
**** Algorithm is fixed, all we can do is figure out how to map the nodes sequentially in memory
**** Want number of blocks containing nodes to be relatively small.
**** All existing orders(pre, in, post) fail, we need van Emde Boas order of N nodes
***** Carve tree at middle level, split into triangles
****** Single for top, one for each child tree below that level.
***** sqrt(n) nodes in each triangle.
***** sqrt(n) + 1 triangles.
***** Recursively layout each triangle, concatenate.
***** Looks like it's tuned for B = constant C, but is tuned for all B's.
***** van Emde Boas data structure, see lecture 10
**** Analysis: know B
***** Look at level of detail straddling B.
****** Exists some level of split such that:
****** Little triangles of size less than B
***** Don't care about layout because small triangles will be of size less than B.
***** They will live in at most 2 blocks
****** Bigger triangles greater than B
***** How many little triangles do I need to visit along a root to leaf path?
****** O(Log(root(N)))
****** height = theta(log(B)) since we chose tree as the one where the sizes change.
****** After (log N)/(1/2 log(B)) little blocks, will reach leaf
****** 4 log_B(N) memory transfers
****** Essentially binary searching on B.
** Cache-oblivious B-trees
**** How to make dynamic
***** Difficult in this world, can't easily make inserts / deletes.
***** Cache-oblivious B-tree
****** 1) Ordered file maintenance: Delay till next lecture [ insert link ]
******* Store N elements in specified order, want to store in an array of linear size
******* Will have "gaps" or blank spots.
******* updates: delete element, insert element between two given points by moving elements in interval of size O(log ** 2(N)) amortized by O(1) scans, is O (Log ** 2 (N) / B ) time
******* Search: Can't use binary search, must use van Emde Boas layout to get that cache time
******* Make tree with the max as the interior node key
****** 2) Use ordered file maintenance data structure to store our keys
****** 3) Update
******* Insert just moved and updated nodes in postorder traversal, to recompute maxes.
****** 4) Update analysis
******* Look at level of detail straddling B
******* We visit the leaves first, in postorder
******* Post-order = alternate between 2 little triangles = free
******** We update value, move linearly through array, sequential cache line motion
******** Want at least 4 blocks of cache, that is that we can move between border elements quickly without missing cache on each load
******* We pay one fetch at beginning and end, but in between get for free
******* We can treat as amortized. Size of interval is log ** 2 (N) amortized, so
******* Number of little triangles O( 1 + ( (log ** 2(N)) / B) ) amortized.
******** Also number of blocks
******** Also number of memory transfers
******* Levels above the bottom two:
******** Number of nodes that need updating = O ( (log ** 2(N) / B) + log(N))
******** Overall running time per update = O(log_(b+1) (N) + ( (log ** 2 (N)) / B) )
******** Not optimal (matching a b-tree), but other bounds good.
******** Want to get rid of B term.
******** For small B, we aren't happy. We can use a technique called indirection.
****** 5) Use previous data structure on N / log(N) items
******* Put theta(log(N)) items per leaf. Each min item gets stored in the tree.
******* Size is either 1/4 log (N) or log(N).
******** Slows down updates in top by factor of log(N).
******** O( (log_(B+1) + (log ** (N) / B)) / log(N) + log(N) / B)

* Memory Hierarchies II
** Recall:
O(log_B (N)) insert/delete/search
** Today
- ordered file maintenance
- list labeling
- cache-oblivious priority queue
** Ordered file maintenance
*** Store N items in a specified-order array of size O(N) with O(1)-size gaps subject to inserting/deleting in that order.
**** By moving interval of size O(log ** 2 (N)) amortized via O(1) interleaved scans.
*** Have array, when you insert item, find interval containing that item that is not too full or sparse.
*** Grow interval around until we have enough gaps, and then rearrange items.
*** Evenly redistributes answer over new interval.
*** BBST!
**** Each leaf has chunk of theta(log(N)) # elements
**** Each interior node represents interval
**** Update:
***** Update leaf O(log(N)) chunk
***** Walk up tree until reaching node/interval within threshold
***** density(node) = (#elements in interval) / (#empty slots in interval)
***** density thresholds depend on depth d of node
****** want density >= 1/2 - (1/4) * (d/h)
****** want density <= 3/4 + (1/4) * (d/h)
***** Want to maintain density everywhere
***** Evenly rebalance descendant elements in interval
**** Analysis:
***** Density thresholds get tighter as we move up
***** When we rebalance node, we put children *far* within threshold
****** |density - threshold| >= 1 / (4h) = theta ( 1 / log(N))
****** At least one item
***** If we are far within threshold, we can charge to those nodes
***** How long will it take for them to get back out of threshold?
***** So before this node rebalances again, one of it's children must get out of balance
***** Implies (size of interval) /  theta(log(N)) updates.
***** Charge the rebalance cost, which is theta(size of the interval) to these updates within the interval.
***** Each update gets charged h = O(log(N)) times 
***** So we have O ( log ** 2 (N) ) amortized time.
** List labeling: Maintain a linked list subject to insert/delete
*** Each node stores a label, and label is always monotone
*** Bounds

| Label space                            | best known time / update |
| (1 + epsilon)n * nlogn                 | O(log ** 2 (N)           |
| n ** (1 + epsilon) * n ** (O (log(N))) | theta( log(N) )          |
| 2 ** n                                 | O(1)                     |

**** For polynomial space, log(N)!
***** density <= (1/ (2 ** d))
***** Big gap
** List order maintenance problem
*** Maintain a linked list subject to insert / delete node here, and order query
**** Order query: is node x before node y in the list?
**** List labelling with label space n ** 2 for interior = top label
**** List label space 2 ** (log(N)) for leaf / groups = bottom label
**** Composite label = (top label, bottom label)
**** Operations O(1) at leaves, is log(N) inside tree, but is constant amortized
***** Is like skip list with two levels
*** Possible to get O(1) worst case time, proven
** Cache-oblivious Priority Queue
*** Assume a tall cache: M = omega(B ** (1 + epsilon)
*** log(log(N)) levels of size N, (N ** (2/3)), (N ** (4/9)...., c = O(1))
*** Level x ** (3/2) has 1 up buffer of size x ** (3/2) and <= x ** (1/2) down buffers each of size theta(x) where all but first is constant fraction full.
*** Invariants:
**** Towards bottom, that's where min is. You're in constant size, find/delete min is O(1) time.
**** Also insert there, things will trickle up.
**** Note: Each down buffer is unordered, but the down buffers are sorted relatively
***** All up items larger than all down items.
*** Insert: 
**** 1) Append to the bottom up buffer
**** 2) Swap into bottom down buffers if necessary, is O(1)
**** 3) if up overflows: Push
***** Generic Push
****** Push (X) elements into level x ** (3/2)
******* Is one larger than the one you tried to insert into(X)
****** 1) Sort X items, cache obliviously
****** 2) Distribute, fix to meet invariant. Distribute among down buffers and possibly up buffer in x ** 3/2
******* Scan elements
******* Visit down buffers in order
******* Down buffers have size limit, so inserting might overflow down buffer.
******* If a down buffer overflows, split it in half.
******* When the number of down buffers overflows, move the last down buffer into the up buffer.
******* If up buffer overflows, push one level up.
****** 3) Recurse
***** Analysis: Push at level x ** (3/2)
****** Ignoring recursion, costs O( (x/b) log(N/B) (x/b) )
******* This is the cost of sorting anyways.
****** Assume that all lists up to size m are permanently in cache and have 0 access cost.
****** Assume x ** (3/2) > M
****** Tall cache assumption. Actually, want a bigger assumption
******* Assume x ** (3/2) > M >= B ** 2
******* x > B ** (4/3)
******* ( x / B ) > 1
****** Distribution step costs O ( x / B + x ** (1/2)) memory transfers.
****** If x >= (B ** 2), we are happy. O(x/b)
******* Else B ** (4/3) <= X <= (B ** 2)
******** The x ** 1/2 term disappears for this level because x ** (1/2) < B < M/B
******** Basically can only push once per push, sum for amortization.
********* We have a super-geometric series.
********* O(1/b *  log_(n/b)(N/B)) amortized
****** Delete mentioned, but could not find in posted notes. Seems straightforward though.
* Memory Hierarchies III
** Distribution swep via lazy funnelsort
** Orthogonal 2d range searching
*** batched, online
** Lazy Funnelsort
*** k-funnel: merges k sorted lists of total size theta(k ** 3) in O (k ** 3 / B- log_(kb) (k/b) + k)
*** Inputs at bottom, total = k ** 3
*** Output buffer at top of funnel has size k ** 3
*** Fill body of funnel with recursive funnels
*** Buffers between layers: k ** 2
*** Funnelsort = (N ** (1/3))-way mergesort with (N ** 1/3) funnel as merger.
*** Subroutine
**** Fill buffer
**** Want buffer to be completely full after fill is ran
**** Just a regular binary merge of two children buffers, moving into the output buffer
**** Case: Child buffer empties
***** Recursively fill it
** Batched Orthogonal range searching
*** Distribution sweep
**** Funnelsort can be used to divide and conquer on each coordinate
**** Can augment the binary merge to maintain auxiliary information
*** Given N points, N rectangles, want to know which points in each rectangle
**** Batch provides queries before the points
***** O(time to sort + (output_size / B))
**** First count number of rectangles containing each point, gives us output size
***** Problem: If output size is big, reporting can mess with cache
***** 1) Sort points and corners by X
***** 2) Divide + conquer on y with distribution sweep where binary merger = upward sweep line algorithm
****** Upward sweep line algorithm:
******* Left, right children buffers of node should be adjacent to each other in memory
******* If left contains endpoints of a rectangle that contains all of R in it
******** We must add this information during the merge
******** Must maintain C_L = number of active rectangles with left corners in L which span R.
******** C_R is symmetrically defined
******* When we encounter a point, not a corner, we add C_(other_side) to it's counter
****** Confusion is that we are sorting / doing divide and conquer on X, but the merge in the mergesort is done on Y
***** Now we know the output size
** Online cache-oblivious 2D orthogonal range searching
*** Give me the points in the rectangle
*** Goal: O(log_B(N) + output/B)
**** Regular / 4-sided range search
*** space:
**** 2-sided = O(N)
**** 3-sided = O(N log(N)
**** 4-sided = O(N ( log**2(N) / loglog(N)))
*** Slower than batched, but online
*** Space is the hard part
| Dimension | RAM                           | Cache oblivious                     |
|         2 | O(N)                          | O(N)                                |
|         3 | O(N)                          | O(N log(N))                         |
|         4 | O(N * (log(N)/ (loglog(N))) ) | O(N * (log ** 2 (N)/ (loglog(N))) ) |
*** 2-sided
**** less than or equal x, y values. Looking for values in quarter plane.
**** VEB BST on Y coordinate of points
**** Follow leaf into array. Array not sorted by x or y. Walk to the right until you find an X too big.
***** We want all of the points with the X smaller than the given X limit.
**** Claim: Array has linear size
***** The number of items you need to traverse in array is O(output)
***** Insane performance, very cool
**** Claims:
***** Algorithm finds the right answer
***** The number of scanned points is O(output)
***** Array has size O(N)
**** Density: query(<= x, <= y) is dense
***** If the number of points in ( <= x, *) is less than alpha * number of points in the answer
****** Easy case
***** Else the range is sparse
**** First try at proof:
***** S_0 = all points sorted by x
****** S_i are the points to the right afte point i
***** If in your query, Y is very large, guaranteed to be dense
***** Consider first sparse query, where query (<= x, <= y) is spares in S_(i-1).
***** Query will be dense in only one S_i
***** Continue until S_i is O(1)
***** Won't work because array is O(N).
**** Second try
***** Maximize common suffix
****** Speaker doesn't know what it means? Was on lecture notes.
***** x_i = max x such that (<= x, <= y) is sparse in S_(i-1)
***** We know that queries above us are dense, same for to the right of us.
***** P_(i-1) = S_(i-1)^ (N) (<= x, <= y)
***** S_i = S_i^N
****** Intersect previous set with the union ( (*, <= y), (> x_i, *))
***** We can't store S_i's because they're too big.
***** We'll store P_i's and the final S_i. We can get down to something with constant size before storing S_i
**** Analysis: Claim has linear size
***** Space: | P_(i-1) intersect S_i | <= (1/alpha) * |P_(i-1)|
****** Will use to get charging scheme to prove linear size
****** The value between bars is the size of the answer for x_i, y_i.
****** P_(i-1) is the number of points in (<= x_i, *)
***** Charge storing P_(i-1) to P(i-1 to S_i)
****** These are the points that we are discarding from our region. This is good, if we charge to them now we'll never charge to them in the future.
****** The charge is to many more elements than the query returns, is a very small cost.
***** Therefore the array of P_0, P_1.... has linear size
***** Since we are using P_i, not S_i, we have to look into more points
***** Query: Geometric series that's decreasing
****** By the charging claim, the P's are decreasing. Therefore we're searching in increasingly smaller collections.
**** Correctness
***** The first element in the very next P_i will start over from x= -infinity, and all proceed forward
***** Once you exceed threshold, you can stop.
*** 3-sided
**** Binary tree, each nodes has 2 augmented structures
***** One structure does range queries, the other does inverted range queries
***** Find the Least Common Ancestor
**** Build in Van Eden Boas layout
**** O(nlogn) space, used same trick before to avoid the log factor
*** 4-sided
**** Could use exactly the same trick, lose another log factor in space
**** Want to try to do slightly better
**** Do same thing as 3-sided, but not on a binary tree. Use root-log-n-ary trees.
**** Imagine contracting VEB tree into chunks with sqrt(log(N)) children.
**** Store BST on y at those points
***** Lets us avoid querying leaves between endpoint leaves
***** Use 3-sided range query.
***** BST on y has O(N) size
***** O(nlogN) size for top-level tree, from previous structure
**** Since height of the tree is 2 * log(N) / (loglog(N)), we get that last little bit of space improvement.

No comments:

Post a Comment