## Thursday, October 22, 2015

### History of Memory Models

This is the last subject from MIT's advanced data structure course that I'll be covering. The history of memory model section covers the different models of execution that algorithm analyses have used, and the unique bounds that richer models offer. Cache obliviousness will play a key role in my future topics of research for this study.

* History of memory models ** Memory hierarchy in practice *** Many layers, different speeds *** Latency is big cost *** Throughput is actually high. *** Evaluating Models **** Blocking, caching, levels, and simplicity **** Want all *** Idealized two-level storage **** RAM = blocks of <= B bytes **** One block operation ***** Read up to B items from two blocks ***** Write to third block k **** Ignore item order within block ***** CPU operations considered free **** Items are indivisible. **** Permutation lower bound ***** Permuting N items to N/B full specified blocks needs theta((N/B) operations to touch them * log(B)) ****** In average case ***** Assuming N/B > B ****** Tall disk ***** Simplified model ****** Move items instead of copy ***** Potential = sum of n_ij * log (n_ij) ***** Maximized target configuration of full blocks (n_ij = B): phi = N log(B) ***** Random configuration with N/B > B has E[n_ij] = O(1) => E[phi] = O(N) ***** Claim: block operation increases phi by <= B ****** Fact: If we merge two clusters, each (x+y) log(x+y) the potential could only have only gone up by (x+y) ***** Theorem: theta (N/B log(B)) ****** Tight for B=O(1) ***** Theorem: theta (N/B log(N/B)) ****** Similar to radix sort where key = target block index ****** Accidental claim: tight for all B (< N/B) ******* Not true ****** We will see tight for B > log(N/B) **** External memory & word ram ***** Algorithms work in word ram model *** Red-Blue pebble game **** A 2-level memory model with a cache and a main memory **** No blocks, simply items **** Pebble game ***** View computation as a DAG, convert input to outputs ***** Pebble = "in memory" ***** Moves: ****** Place pebble on a node if all predecessors have pebble ******* Pebble means in memory ****** Remove pebble from node, discard computation results ***** Goal: Place pebble on all output nodes ****** Minimize maximum number of pebbles over time ******* Represents minimum memory to run program ***** Proof: Any DAG can be executed O(n / log(N)) maximum pebbles ****** Hopcroft, Paul, Valiant - ACM 1977 ****** Corollary: ******* DTIME(t) subset of DSPACE(t / log(t)) **** Red-Blue pebble game ***** Red pebble = "in cache" ***** Blue pebble = "on disk" ***** Moves ****** Place red pebble on node if all predecessors have red pebble ****** Remove pebble from node ****** Write: Red pebble -> blue pebble ****** Read: Blue pebble -> red pebble ***** Goal: Blue inputs to blue outputs ****** Cache size given, not the target of minimization ****** <= M red pebbles at any given time ****** Minimize number of cache <-> disk I/Os ******* Represents memory transfers ***** Results ****** Many computation DAGs have different times ****** Found times with speedup over regular RAM analysis ******* ex: FFT with standard algorithm: ******** theta(N log_M (N)) transfers have speedup theta(log(M)) *** I/O model **** Red-blue has cache, idealized has blocks. ***** We want both. Combine to get IO model ***** Done by Aggarwal & Vitter, ICALP 1987 **** Disk + cache has B items each **** Cache has M/B blocks for cache of size M **** Moves ***** Can move disk block to cache block ***** Can compute in cache ***** Can write cache to disk **** Really good model, many results **** Results ***** Scanning: Visiting N elements in order costs O(1 + N/B) memory transfers ****** This is optimal if N is much less than B ****** More generally can run <= M/B parallel scans keeping 1 block per scan in cache ******* ex: Merge O(M/B) lists of total size N in O(1 + N/B) memory transfers ****** Does the B factor matter? ******* ex: Should I presort my linked list versus traversal? ******* With cache and disk, N memory transfers is 71 hours ******* With cache and disk, N/B memory transfers takes 32 seconds ***** Search ****** Finding an element x among N items requires O(log_(B+1) N) memory transfers ******* Not binary search, B-way search ****** Lower bound in comparison model ******* Each block reveals where x fits among B items ******* Learn <= log(B+1) bits per read ******* Need log(N+1) bits ****** Upper bound ******* B-tree ******* Insert and delete in O(log_(B+1) (n)) ***** Sorting and Permutation ****** Sorting bound: theta( (N/B) log_(M/B)(N/B)) ******* Always keep cache sorted (free) ******* Might as well presort each block ******* Upon reading a block, learn how B items fit among M items in cache ******* Learn log(M+B choose B) ~ B log(M/B) bits per step ******* Need to learn log(N)! ~ N log N bits ******* Know N log B bits from block presort ****** O(M/B)-way mergesort ******* Split into M/B equal-sized pieces and merge ******* T(N) = M/B T(N/(M/B)) + O(1 + N/B) ******* T(B) = O(1) ******** From cache ******* Recursion tree gives us N/B items at every level ******* We have N/B leaves ******* log_(M/B) (N/B) levels ****** Distribution sort ******* sqrt(M/B)-way quicksort ******* 1) Find sqrt(M/B) partition elements, roughly evenly spaced ******* 2) Partition array into sqrt(M/B) + 1 pieces ******** Scan O(N/B) memory transfers ******** Algorithm: Like bloom ********* 1) For first, second, ... interval of M items ********** Sort in O(M/B) memory transfers ********** Sample every (1/4) sqrt(M/B)-th items ********** Total sample = 4N / sqrt(M/B) items ********* 2) For i = 1, 2, ... sqrt(M/B) ********** Run linear-time selection to find sample element at i / sqrt(M/B) fraction ********** Cost: O((N/sqrt(M/B))/B) ********** Total: O(N/B) memory transfers ******* 3) Recurse ******** Same recurrence as mergesort ****** Permutation bound: theta(min {N, (N/B)log_(M/B)(N/B) }) ******* Know where things should go, just need to move them there ******* Either use sort or use naive RAM algorithm ******* Solves Floyd's two-level storage problem, M=3B ******* *** Random vs Sequential IO **** Sequential memory transfers are part of bulk read/write of theta(M) items **** Random memory transfers otherwise **** Sorting ***** 2-way mergesort achieves O((N/B) log (N/B)) sequential ****** Best you can do ***** o(N/B log_(M/B) (N/B)) random implies omega((N/B) log (N/B)) **** Same trade-off for suffix-tree construction ***** Suffix tree = sorting + scans *** Hierarchal Memory Model **** Non-uniform-cost RAM ***** Accessing memory location x costs f(x) = [log x] ***** Why f(x) = log(x)? ****** Mead and Conway 1980 had that time for RAM access ***** Upper & lower bounds: Aggarwal, Alpern, Chandra, Snir - STOC 1987 ***** What if f(x) isn't a log? ***** Assume f(2x) <= c f(x) for a constant c > 0 ****** Polynomially bounded ***** Write f(x) = Sum of weight_i * [x > x_i ?] ****** Weigthed sum of threshold functions ***** Uniform optimality ****** Consider one term f_m(x) = [x >= M] ****** Algorithm is uniformly optimal if optimal on HMM_(f_m(x)) for all M ****** Implies optimality for all f(x) ****** This is the red blue pebble game! ***** Achieve tight bounds with uniform optimality **** Implicit HMM / non-uniform ***** Instead of algorithm explicitly moving data, using any conservative replacement strategy to evict from cache ***** 2 * T_LRU(N, M) <= T_OPT(N, 2M) = O(T_OPT(N, M)) assuming f(2x) <= c f(x) ***** Not uniform because we need M! **** Uniform solution ***** For general f, slpit memory into hcunks at x where f(x) doubles ***** Do LRU on this structure, from n to n+1 ***** T_LRU(N) = O(T_OPT(N) + N * f(N)) ****** Like MTF **** HMM with Block Transfer ***** Accessing memory location at x costs f(x) ***** Copying memory interval from x - delta .. x to y - sigma ... y costs f(max{x, y}) + delta ****** Models memory pipelining, block transfer ****** Ignores block alignment, explicit levels ***** Times messier, they depend on F. *** MHH **** Multilevel version of external-memory model ***** Most realistic ***** M_i <-> M_(i+1) happen in blocks of size B_i(Subblocks of M_(i+1)), and take t_i time. ***** All levels can be actively transferring at once ***** Alpern, Carter, Feig, Selker - FOCS 1990 **** Uniform memory hierarchy ***** Fix aspect ration and block growth factor ***** Reduce it to two parameters and one function ***** t_i grows exponentially ***** Results: Hard to state when good or bad. *** Cache-Oblivious Model **** Analyize RAM algorithm (not knowing B or M) on external memory model ***** Must work well for all B and M ***** Uniformity **** Lose factor of 2 in M and number of transfers due to not being able to manually manage ***** Assume T(B, 2M) <= C T(B, M) **** Clean model **** Adapts to changing B (ie: disk tracks) and changing M(eg: competing processes) **** Adapts to multilevel memory hierarchy ***** Assuming inclusion **** Scanning ***** Ramachandran, Frigo, Leiserson, Prokop - FOCS 1999 ***** Visit N elements in order costs O(1 + (N/B)) memory transfers ***** More generically, can run O(1) parallel scans ****** Assume M >= c B for appropriate constant c > 0 ***** Eg: merge two lists in O(N/B) memory transfers **** Searching ***** Prokop - Meng 1999 ***** Want O(log_B(N)) time ***** 1) Imagine a binary search tree on items ****** Want search tree, can't do B-way. Don't know be. ***** 2) Recursively cut into first part and bottom parts ****** Works well! ****** sqrt(N) chunks of size sqrt(N) ****** At some level of recursion, you'll be looking at the level where > B and < B on either side of you ******* Look inside of this, this is 4 log_B (N) total transfers ***** Look at B-trees, O(log_B(N)) per operation **** Sorting ***** O( (N/B) log_(M/B) (N/B)) possible, assuming M >= theta(B ** (1+epsilon)) (tall cache) ****** Funnel sort ******* Mergesort analoge ****** Distribution sort ******* Prokop, FOCS 1999 ****** Impossible without tall-cache assumptions ******* STOC 2003