The second part of the succinct section follows the previous succint/string section. This focuses on the problem of succinctly representing suffix arrays, an important optimization to keep memory pressure down for string algorithms.

* Succinct II: Succinct Suffix Arrays & Trees

** Intro *** Compact & Succinct suffix arrays / trees *** O(T loglogT)-bit suffix array *** O(T)-bit suffix array *** Succint suffix array -> tree *** Big field to itself, many complicated ** Previous Works: Compact suffix arrays and trees *** Grossi & Vitter: STOC 2000, SICOMP 2005 **** STOC 2000 **** O((P / log_sigma (T)) + |output| + ((log_sigma) ** epsilon)(T)) query **** (1 + 1 / epsilon) |T| log |sigma| + 2 |T| + O(T / loglog (T)) bits ***** Note: |T| log |sigma| is what it takes to write down, optimal **** In lecture will achieve same space, looser query bounds *** Ferragina & Marzini **** FOCS 2000 ; JACM 2005 **** 5 * H_K(T) * |T| + O((T * ((sigma + log(T)) / loglog(T) + (T ** elpsilon) sigma ** (sigma + 1)) ***** Bad for big sigma ***** Good for small ***** H_K(T) = k-th order empirical entropy ****** 0-th order entropy = H_0(string S) = sum of probability of char * log(1/prob) for all chars in sigma ******* Space = H_0(s) * |s| ****** H_K(T) = sum of all strings of length k, iteration string w of ((number of occurrences of w / length of T) * H_T(string of successor of w)) ****** Independent of K, a constant ****** Compresses very well, used in bzip ****** Optimal if probability of a code word can depend on prefix. **** O(|P| + |output| * log ** sigma(T)) query *** Sadakane 2003 **** Great for large alphabets ** Succinct Data Structures for suffix arrays / trees *** Grossi, Gupta, Vitter: H_K * |T| + O(T log(sigma) * (loglog(T) / log(T))) **** H_K(T) can't be too small, else dominates. Can't compress a huge amount **** O(P log(sigma) + ((log ** 2 (T)) / loglog(T)) log(sigma)) query *** Ferragina, Manzini, Mikinen, Navarro: **** 1 * H_K(T) * |T| + O(T / log ** (epsilon) ** (N)) **** O(P + |output| * log ** (1+epsilon) (T)) query + O(P) counting query ** Low-space construction *** Son, Sadakane, Sung - FOCS 2003 ** Grossi & Vitter: *** Follow up to space bound, query bound will be weaker *** Compressed suffix array **** SA[k] **** start: ***** Text T_0 = T ***** size n_0 = n = |T| ***** suffix array SA_0 = SA ***** SA[i] = index in T of i-th suffix of T in sorted order **** step ***** T_(K+1) = <T_K[2i], T_K[2i+1]> for i=0, 1, ... n/2 ***** n_(K+1) = n_(k/2) = n / (2 ** k) **** SA_(K+1) = (1/2) * extract even entries from SA_(K+1) ***** 1) even-succ_k(i) = ****** i if SA_k[i] is even ****** j if SA_k[i] = SA[j]-1 is odd ******* Want to know the *rank* of the suffix, not index ***** 2) even-rank_k(i) ****** Number of even suffixes preceding the i-th suffix ****** Number of even values of SA_k[:i] ***** 3) SA_k[i] = 2 * SA_(K+1)[even-rank_k(even-succ_k(i)))] - (1 - is-even_k(i)) ****** Stop recursion at level l = loglog(n) => n_l = n / log(n) ******* Can afford pointer encoding since n_l log n_l <= n bits ****** O(loglog(N)) query to SA[i], given O(1)-time is-even-suffix & even-succ & even-rank ****** is_even_suffix_k(1 if SA_k[i] is even, 0 otherwise) ****** even-rank_k = rank_1 in is_even_suffix ******* O((n_k) (loglog(n_k))/(log(n_k))) ***** even_succ_k is only nontrivial ****** trivial for SA_k[n] if k is even suffix, return n ******* Must check if even suffix in bit vector ****** SA_k[n], k odd ******* Return odd-rank_k(i) = i - even-rank_k(i) ****** Cutting out evens cut out (nk)/2 values ****** Order by i is ******* Order by odd suffix T_k[SA_k[i]:] ******* Order by (T_k[SA_k[i]], even-succ_k(i)) ******** We want to store the second thing ******** We can store the pairs in order by value ******** Assume sigma is a binary alphabet **** Total time: ***** Trying to store sorted array of n_k / 2 values, each of (2 ** k) + log(N_k) bits ****** Actually store explicitly? Nope. ***** Store as follows: ****** Store leading log(n_k) bits of each value v_i via unary differential encoding ****** Uniary diff means use unary to represent diff between each value and predecessor ****** 0 ** (lead(v_1)) | 0 ** (lead(v_2) - lead(v_1)) | ....... ****** Number of times I increment is at most omega(n_k). <= n_k 0's tight bound. ****** Ends up being <= (3/2) n_k ****** Store trailing 2 ** k bits explicitly. ****** Space = (2 ** k) * (n_k / 2) = n/2 bits ****** Total sum = (1/2)nloglog(N) + 5n + O(time to query) ***** Query on structure ****** Look up into sorted array at given index ****** Look up is easy with rank and select ****** Index i is i-th one bit is select_1(i) ****** Rank_0(select_1(i)) is the leading bits of v_i ****** Only thing we need to store is rank and select structure ****** Time to query = O(n / (loglog(n))) ***** We have a O(T (loglog(T)))-bit suffix array ** Compact Suffix Array *** Only store 1/(epsilon) + 1 levels **** 0-th, epsilon*L, 2 * epsilon * L, .... up to loglog(N) *** "Even" is now "divisible by 2 ** (epsilon ** l)" = in SA_((k+1)epsilon * l) **** 1) You find this by calling succ_k(i) repeatedly -> i' until is_even ***** Potentially epsilon_l steps, potentially 2 ** (epsilon*l) steps ***** Succ = go to next suffix **** 2) Recurse SA(k+1)(epsilon * l) [even_rank_k(j)] **** 3) Multiply by 2 ** (epsilon * l), subtract by number of steps in 1) *** O(log ** (epsilon-prime) n)-query to SA[i] *** Space is (1/epsilon + 6)n + O(n / loglog(N)) bits ** Optimizations *** Note that at level 0, the successor structure is free *** Store is_even_k as a succint dictionary(with rank) **** Brodnik and Munro of L17 ** Suffix array->tree *** Store compressed binary trie on 2n + O(1) => 4n + o(n) bits **** Balanced paren structure *** In a search, we need to know the length of an edge. Need to know if we should branch/jump off **** Look at subtrees and leftmost/rightmost leaf, look at interior-most leaves ***** Find leaves by finding longest match between left-most(y) and rightmost(y) **** O(P * cost(Suffix array access)) ** Succint suffix tree *** Use suffix tree aboxe every bth suffix *** Then the bottom chunks are just big O(N/B) bits *** Use tiny lookup table, will be fine *** Can be linear sweep **** Look for all b of patterns at once in blocks, you end up with a liner scan. **** Will be absorbed O(P+b) time.

## No comments:

## Post a Comment