My notes for both Memory Hierarchies lectures have been included below:

* Memory Hierarchies I ** External Memory Model *** CPU, fast cache, slow link to big storage *** B = block size *** M = total cache size *** Note: when you request any memory, you get back a whole block *** Want to use all items in block before eviction *** In model we have full control of cache. *** Care about number of memory transfers **** <= RAM running time, one memory transfer for each access **** >= (the number of cell probes word reads) / B *** Basic results in this model **** Scanning O(ceiling of N/B) to read/write N words in order **** Search trees = B-trees ***** B-trees with branching factor O(B) support insert, delete, pred search in O(log_(b+1)(n)) memory transfers ****** Since pointers inside each level is a cache jump ***** Theta(log_(B+1)(N)) memory transfers for search in comparison model, so above is optimal ***** Each block read reveals where query fits among B items implies <= log(B+1) bits of info. **** Sorting: O((N/B) log_(N/B) (N/B)) memory transfers ***** Will not result from using B-trees. **** Permuting: O(min {N, (N/B) log_(N/B) (N/B}) ***** Permute things arbitrarily ***** 2 ideas: sort items, run permutations, construct new copy of array in right order in O(N) time. **** Buffer tree: Dynamic version of sorting. ***** O( (1/B) log_(M/B) (N/B)) amortized memory transfers for delayed queries. ***** Delayed queries, batch updates ***** O cost find min ****** priority queue: insert/delete at normal time, queries delayed, peek is free ** Cache-oblivious Model *** Newer, from 1999 from MIT *** Similar to external memory model but Algorithm doesn't know B or M *** Can't decide when to evict things, process beforehand. *** Algorithm looks like a RAM algorithm, easier to read *** Cache replacement is automatic, don't know line size. We assume it is optimal. **** Can use LRU or FIFO, very close to optimal. O(1) competitive given a cache of twice the size as future-seeing optimal one. *** TODO Cache oblivious algorithms are optimal for all B and M, so structure is nice to use. *** We essentially adapt to different B and M **** Each level of memory has a different B and M, we optimize for each in an adaptive system. **** No nice formalization of this goal *** Basic Results in cache-oblivious model **** Scanning: Same as external, sequential scan in O(n/B) ceiling **** Search Trees: insert, delete, search in O(log_(B+1)(N) memory transfers **** Sorting: O( (N/B) log_(N/B) (N/B)) memory transfers **** Permuting: min impossible ***** Can achieve old two targets, but can't choose best because don't know B for min equations. **** Priority queue: O( (1/B) log_(N/B) (N/B)) amortized memory transfers ***** Assumes "tall cache" ****** M = omega(B + (1 + e)), e = 1.00000000001 *** Cache oblivious static search trees **** Can't use B-trees, don't know what B is. **** Use balanced BST **** Algorithm is fixed, all we can do is figure out how to map the nodes sequentially in memory **** Want number of blocks containing nodes to be relatively small. **** All existing orders(pre, in, post) fail, we need van Emde Boas order of N nodes ***** Carve tree at middle level, split into triangles ****** Single for top, one for each child tree below that level. ***** sqrt(n) nodes in each triangle. ***** sqrt(n) + 1 triangles. ***** Recursively layout each triangle, concatenate. ***** Looks like it's tuned for B = constant C, but is tuned for all B's. ***** van Emde Boas data structure, see lecture 10 **** Analysis: know B ***** Look at level of detail straddling B. ****** Exists some level of split such that: ****** Little triangles of size less than B ***** Don't care about layout because small triangles will be of size less than B. ***** They will live in at most 2 blocks ****** Bigger triangles greater than B ***** How many little triangles do I need to visit along a root to leaf path? ****** O(Log(root(N))) ****** height = theta(log(B)) since we chose tree as the one where the sizes change. ****** After (log N)/(1/2 log(B)) little blocks, will reach leaf ****** 4 log_B(N) memory transfers ****** Essentially binary searching on B. ** Cache-oblivious B-trees **** How to make dynamic ***** Difficult in this world, can't easily make inserts / deletes. ***** Cache-oblivious B-tree ****** 1) Ordered file maintenance: Delay till next lecture [ insert link ] ******* Store N elements in specified order, want to store in an array of linear size ******* Will have "gaps" or blank spots. ******* updates: delete element, insert element between two given points by moving elements in interval of size O(log ** 2(N)) amortized by O(1) scans, is O (Log ** 2 (N) / B ) time ******* Search: Can't use binary search, must use van Emde Boas layout to get that cache time ******* Make tree with the max as the interior node key ****** 2) Use ordered file maintenance data structure to store our keys ****** 3) Update ******* Insert just moved and updated nodes in postorder traversal, to recompute maxes. ****** 4) Update analysis ******* Look at level of detail straddling B ******* We visit the leaves first, in postorder ******* Post-order = alternate between 2 little triangles = free ******** We update value, move linearly through array, sequential cache line motion ******** Want at least 4 blocks of cache, that is that we can move between border elements quickly without missing cache on each load ******* We pay one fetch at beginning and end, but in between get for free ******* We can treat as amortized. Size of interval is log ** 2 (N) amortized, so ******* Number of little triangles O( 1 + ( (log ** 2(N)) / B) ) amortized. ******** Also number of blocks ******** Also number of memory transfers ******* Levels above the bottom two: ******** Number of nodes that need updating = O ( (log ** 2(N) / B) + log(N)) ******** Overall running time per update = O(log_(b+1) (N) + ( (log ** 2 (N)) / B) ) ******** Not optimal (matching a b-tree), but other bounds good. ******** Want to get rid of B term. ******** For small B, we aren't happy. We can use a technique called indirection. ****** 5) Use previous data structure on N / log(N) items ******* Put theta(log(N)) items per leaf. Each min item gets stored in the tree. ******* Size is either 1/4 log (N) or log(N). ******** Slows down updates in top by factor of log(N). ******** O( (log_(B+1) + (log ** (N) / B)) / log(N) + log(N) / B) * Memory Hierarchies II ** Recall: O(log_B (N)) insert/delete/search ** Today - ordered file maintenance - list labeling - cache-oblivious priority queue ** Ordered file maintenance *** Store N items in a specified-order array of size O(N) with O(1)-size gaps subject to inserting/deleting in that order. **** By moving interval of size O(log ** 2 (N)) amortized via O(1) interleaved scans. *** Have array, when you insert item, find interval containing that item that is not too full or sparse. *** Grow interval around until we have enough gaps, and then rearrange items. *** Evenly redistributes answer over new interval. *** BBST! **** Each leaf has chunk of theta(log(N)) # elements **** Each interior node represents interval **** Update: ***** Update leaf O(log(N)) chunk ***** Walk up tree until reaching node/interval within threshold ***** density(node) = (#elements in interval) / (#empty slots in interval) ***** density thresholds depend on depth d of node ****** want density >= 1/2 - (1/4) * (d/h) ****** want density <= 3/4 + (1/4) * (d/h) ***** Want to maintain density everywhere ***** Evenly rebalance descendant elements in interval **** Analysis: ***** Density thresholds get tighter as we move up ***** When we rebalance node, we put children *far* within threshold ****** |density - threshold| >= 1 / (4h) = theta ( 1 / log(N)) ****** At least one item ***** If we are far within threshold, we can charge to those nodes ***** How long will it take for them to get back out of threshold? ***** So before this node rebalances again, one of it's children must get out of balance ***** Implies (size of interval) / theta(log(N)) updates. ***** Charge the rebalance cost, which is theta(size of the interval) to these updates within the interval. ***** Each update gets charged h = O(log(N)) times ***** So we have O ( log ** 2 (N) ) amortized time. ** List labeling: Maintain a linked list subject to insert/delete *** Each node stores a label, and label is always monotone *** Bounds | Label space | best known time / update | | (1 + epsilon)n * nlogn | O(log ** 2 (N) | | n ** (1 + epsilon) * n ** (O (log(N))) | theta( log(N) ) | | 2 ** n | O(1) | **** For polynomial space, log(N)! ***** density <= (1/ (2 ** d)) ***** Big gap ** List order maintenance problem *** Maintain a linked list subject to insert / delete node here, and order query **** Order query: is node x before node y in the list? **** List labelling with label space n ** 2 for interior = top label **** List label space 2 ** (log(N)) for leaf / groups = bottom label **** Composite label = (top label, bottom label) **** Operations O(1) at leaves, is log(N) inside tree, but is constant amortized ***** Is like skip list with two levels *** Possible to get O(1) worst case time, proven ** Cache-oblivious Priority Queue *** Assume a tall cache: M = omega(B ** (1 + epsilon) *** log(log(N)) levels of size N, (N ** (2/3)), (N ** (4/9)...., c = O(1)) *** Level x ** (3/2) has 1 up buffer of size x ** (3/2) and <= x ** (1/2) down buffers each of size theta(x) where all but first is constant fraction full. *** Invariants: **** Towards bottom, that's where min is. You're in constant size, find/delete min is O(1) time. **** Also insert there, things will trickle up. **** Note: Each down buffer is unordered, but the down buffers are sorted relatively ***** All up items larger than all down items. *** Insert: **** 1) Append to the bottom up buffer **** 2) Swap into bottom down buffers if necessary, is O(1) **** 3) if up overflows: Push ***** Generic Push ****** Push (X) elements into level x ** (3/2) ******* Is one larger than the one you tried to insert into(X) ****** 1) Sort X items, cache obliviously ****** 2) Distribute, fix to meet invariant. Distribute among down buffers and possibly up buffer in x ** 3/2 ******* Scan elements ******* Visit down buffers in order ******* Down buffers have size limit, so inserting might overflow down buffer. ******* If a down buffer overflows, split it in half. ******* When the number of down buffers overflows, move the last down buffer into the up buffer. ******* If up buffer overflows, push one level up. ****** 3) Recurse ***** Analysis: Push at level x ** (3/2) ****** Ignoring recursion, costs O( (x/b) log(N/B) (x/b) ) ******* This is the cost of sorting anyways. ****** Assume that all lists up to size m are permanently in cache and have 0 access cost. ****** Assume x ** (3/2) > M ****** Tall cache assumption. Actually, want a bigger assumption ******* Assume x ** (3/2) > M >= B ** 2 ******* x > B ** (4/3) ******* ( x / B ) > 1 ****** Distribution step costs O ( x / B + x ** (1/2)) memory transfers. ****** If x >= (B ** 2), we are happy. O(x/b) ******* Else B ** (4/3) <= X <= (B ** 2) ******** The x ** 1/2 term disappears for this level because x ** (1/2) < B < M/B ******** Basically can only push once per push, sum for amortization. ********* We have a super-geometric series. ********* O(1/b * log_(n/b)(N/B)) amortized ****** Delete mentioned, but could not find in posted notes. Seems straightforward though. * Memory Hierarchies III ** Distribution swep via lazy funnelsort ** Orthogonal 2d range searching *** batched, online ** Lazy Funnelsort *** k-funnel: merges k sorted lists of total size theta(k ** 3) in O (k ** 3 / B- log_(kb) (k/b) + k) *** Inputs at bottom, total = k ** 3 *** Output buffer at top of funnel has size k ** 3 *** Fill body of funnel with recursive funnels *** Buffers between layers: k ** 2 *** Funnelsort = (N ** (1/3))-way mergesort with (N ** 1/3) funnel as merger. *** Subroutine **** Fill buffer **** Want buffer to be completely full after fill is ran **** Just a regular binary merge of two children buffers, moving into the output buffer **** Case: Child buffer empties ***** Recursively fill it ** Batched Orthogonal range searching *** Distribution sweep **** Funnelsort can be used to divide and conquer on each coordinate **** Can augment the binary merge to maintain auxiliary information *** Given N points, N rectangles, want to know which points in each rectangle **** Batch provides queries before the points ***** O(time to sort + (output_size / B)) **** First count number of rectangles containing each point, gives us output size ***** Problem: If output size is big, reporting can mess with cache ***** 1) Sort points and corners by X ***** 2) Divide + conquer on y with distribution sweep where binary merger = upward sweep line algorithm ****** Upward sweep line algorithm: ******* Left, right children buffers of node should be adjacent to each other in memory ******* If left contains endpoints of a rectangle that contains all of R in it ******** We must add this information during the merge ******** Must maintain C_L = number of active rectangles with left corners in L which span R. ******** C_R is symmetrically defined ******* When we encounter a point, not a corner, we add C_(other_side) to it's counter ****** Confusion is that we are sorting / doing divide and conquer on X, but the merge in the mergesort is done on Y ***** Now we know the output size ** Online cache-oblivious 2D orthogonal range searching *** Give me the points in the rectangle *** Goal: O(log_B(N) + output/B) **** Regular / 4-sided range search *** space: **** 2-sided = O(N) **** 3-sided = O(N log(N) **** 4-sided = O(N ( log**2(N) / loglog(N))) *** Slower than batched, but online *** Space is the hard part | Dimension | RAM | Cache oblivious | | 2 | O(N) | O(N) | | 3 | O(N) | O(N log(N)) | | 4 | O(N * (log(N)/ (loglog(N))) ) | O(N * (log ** 2 (N)/ (loglog(N))) ) | *** 2-sided **** less than or equal x, y values. Looking for values in quarter plane. **** VEB BST on Y coordinate of points **** Follow leaf into array. Array not sorted by x or y. Walk to the right until you find an X too big. ***** We want all of the points with the X smaller than the given X limit. **** Claim: Array has linear size ***** The number of items you need to traverse in array is O(output) ***** Insane performance, very cool **** Claims: ***** Algorithm finds the right answer ***** The number of scanned points is O(output) ***** Array has size O(N) **** Density: query(<= x, <= y) is dense ***** If the number of points in ( <= x, *) is less than alpha * number of points in the answer ****** Easy case ***** Else the range is sparse **** First try at proof: ***** S_0 = all points sorted by x ****** S_i are the points to the right afte point i ***** If in your query, Y is very large, guaranteed to be dense ***** Consider first sparse query, where query (<= x, <= y) is spares in S_(i-1). ***** Query will be dense in only one S_i ***** Continue until S_i is O(1) ***** Won't work because array is O(N). **** Second try ***** Maximize common suffix ****** Speaker doesn't know what it means? Was on lecture notes. ***** x_i = max x such that (<= x, <= y) is sparse in S_(i-1) ***** We know that queries above us are dense, same for to the right of us. ***** P_(i-1) = S_(i-1)^ (N) (<= x, <= y) ***** S_i = S_i^N ****** Intersect previous set with the union ( (*, <= y), (> x_i, *)) ***** We can't store S_i's because they're too big. ***** We'll store P_i's and the final S_i. We can get down to something with constant size before storing S_i **** Analysis: Claim has linear size ***** Space: | P_(i-1) intersect S_i | <= (1/alpha) * |P_(i-1)| ****** Will use to get charging scheme to prove linear size ****** The value between bars is the size of the answer for x_i, y_i. ****** P_(i-1) is the number of points in (<= x_i, *) ***** Charge storing P_(i-1) to P(i-1 to S_i) ****** These are the points that we are discarding from our region. This is good, if we charge to them now we'll never charge to them in the future. ****** The charge is to many more elements than the query returns, is a very small cost. ***** Therefore the array of P_0, P_1.... has linear size ***** Since we are using P_i, not S_i, we have to look into more points ***** Query: Geometric series that's decreasing ****** By the charging claim, the P's are decreasing. Therefore we're searching in increasingly smaller collections. **** Correctness ***** The first element in the very next P_i will start over from x= -infinity, and all proceed forward ***** Once you exceed threshold, you can stop. *** 3-sided **** Binary tree, each nodes has 2 augmented structures ***** One structure does range queries, the other does inverted range queries ***** Find the Least Common Ancestor **** Build in Van Eden Boas layout **** O(nlogn) space, used same trick before to avoid the log factor *** 4-sided **** Could use exactly the same trick, lose another log factor in space **** Want to try to do slightly better **** Do same thing as 3-sided, but not on a binary tree. Use root-log-n-ary trees. **** Imagine contracting VEB tree into chunks with sqrt(log(N)) children. **** Store BST on y at those points ***** Lets us avoid querying leaves between endpoint leaves ***** Use 3-sided range query. ***** BST on y has O(N) size ***** O(nlogN) size for top-level tree, from previous structure **** Since height of the tree is 2 * log(N) / (loglog(N)), we get that last little bit of space improvement.

## No comments:

## Post a Comment