Thursday, October 22, 2015

Succinct Suffix Arrays and Trees

The second part of the succinct section follows the previous succint/string section. This focuses on the problem of succinctly representing suffix arrays, an important optimization to keep memory pressure down for string algorithms.
* Succinct II: Succinct Suffix Arrays & Trees
** Intro
*** Compact & Succinct suffix arrays / trees
*** O(T loglogT)-bit suffix array
*** O(T)-bit suffix array
*** Succint suffix array -> tree
*** Big field to itself, many complicated
** Previous Works: Compact suffix arrays and trees
*** Grossi & Vitter: STOC 2000, SICOMP 2005
**** STOC 2000
**** O((P / log_sigma (T)) + |output| + ((log_sigma) ** epsilon)(T)) query
**** (1 + 1 / epsilon) |T| log |sigma| + 2 |T| + O(T / loglog (T)) bits
***** Note: |T| log |sigma| is what it takes to write down, optimal
**** In lecture will achieve same space, looser query bounds
*** Ferragina & Marzini
**** FOCS 2000 ; JACM 2005
**** 5 * H_K(T)  * |T| + O((T * ((sigma + log(T)) / loglog(T) + (T ** elpsilon) sigma ** (sigma + 1))
***** Bad for big sigma
***** Good for small
***** H_K(T) = k-th order empirical entropy
****** 0-th order entropy = H_0(string S) = sum of probability of char * log(1/prob) for all chars in sigma
******* Space = H_0(s) * |s|
****** H_K(T) = sum of all strings of length k, iteration string w of ((number of occurrences of w / length of T) * H_T(string of successor of w))
****** Independent of K, a constant
****** Compresses very well, used in bzip
****** Optimal if probability of a code word can depend on prefix.
**** O(|P| + |output| * log ** sigma(T)) query
*** Sadakane 2003
**** Great for large alphabets
** Succinct Data Structures for suffix arrays / trees
*** Grossi, Gupta, Vitter: H_K * |T| + O(T log(sigma) * (loglog(T) / log(T)))
**** H_K(T) can't be too small, else dominates. Can't compress a huge amount
**** O(P log(sigma) + ((log ** 2 (T)) / loglog(T)) log(sigma)) query
*** Ferragina, Manzini, Mikinen, Navarro:
**** 1 * H_K(T) * |T| + O(T / log ** (epsilon) ** (N))
**** O(P + |output| * log ** (1+epsilon) (T)) query + O(P) counting query
** Low-space construction
*** Son, Sadakane, Sung - FOCS 2003
** Grossi & Vitter:
*** Follow up to space bound, query bound will be weaker
*** Compressed suffix array
**** SA[k]
**** start:
***** Text T_0 = T
***** size n_0 = n = |T|
***** suffix array SA_0 = SA
***** SA[i] = index in T of i-th suffix of T in sorted order
**** step
***** T_(K+1) = <T_K[2i], T_K[2i+1]> for i=0, 1, ... n/2
***** n_(K+1) = n_(k/2) = n / (2 ** k)
**** SA_(K+1) = (1/2) * extract even entries from SA_(K+1)
***** 1) even-succ_k(i) =
****** i if SA_k[i] is even
****** j if SA_k[i] = SA[j]-1 is odd
******* Want to know the *rank* of the suffix, not index
***** 2) even-rank_k(i)
****** Number of even suffixes preceding the i-th suffix
****** Number of even values of SA_k[:i]
***** 3) SA_k[i] = 2 * SA_(K+1)[even-rank_k(even-succ_k(i)))] - (1 - is-even_k(i))
****** Stop recursion at level l = loglog(n) => n_l = n / log(n)
******* Can afford pointer encoding since n_l log n_l <= n bits
****** O(loglog(N)) query to SA[i], given O(1)-time is-even-suffix & even-succ & even-rank
****** is_even_suffix_k(1 if SA_k[i] is even, 0 otherwise)
****** even-rank_k = rank_1 in is_even_suffix
******* O((n_k) (loglog(n_k))/(log(n_k)))
***** even_succ_k is only nontrivial
****** trivial for SA_k[n] if k is even suffix, return n
******* Must check if even suffix in bit vector
****** SA_k[n], k odd
******* Return odd-rank_k(i) = i - even-rank_k(i)
****** Cutting out evens cut out (nk)/2 values
****** Order by i is
******* Order by odd suffix T_k[SA_k[i]:]
******* Order by  (T_k[SA_k[i]], even-succ_k(i))
******** We want to store the second thing
******** We can store the pairs in order by value
******** Assume sigma is a binary alphabet
**** Total time:
***** Trying to store sorted array of n_k / 2 values, each of (2 ** k) + log(N_k) bits
****** Actually store explicitly? Nope.
***** Store as follows:
****** Store leading log(n_k) bits of each value v_i via unary differential encoding
****** Uniary diff means use unary to represent diff between each value and predecessor
****** 0 ** (lead(v_1)) | 0 ** (lead(v_2) - lead(v_1)) | .......
****** Number of times I increment is at most omega(n_k). <= n_k 0's tight bound.
****** Ends up being <= (3/2) n_k
****** Store trailing 2 ** k bits explicitly.
****** Space = (2 ** k) * (n_k / 2) = n/2 bits
****** Total sum = (1/2)nloglog(N) + 5n + O(time to query)
***** Query on structure
****** Look up into sorted array at given index
****** Look up is easy with rank and select
****** Index i is i-th one bit is select_1(i)
****** Rank_0(select_1(i)) is the leading bits of v_i
****** Only thing we need to store is rank and select structure
****** Time to query = O(n / (loglog(n)))
***** We have a O(T (loglog(T)))-bit suffix array
** Compact Suffix Array
*** Only store 1/(epsilon) + 1 levels
**** 0-th, epsilon*L, 2 * epsilon * L, .... up to loglog(N)
*** "Even" is now "divisible by 2 ** (epsilon ** l)" = in SA_((k+1)epsilon * l)
**** 1) You find this by calling succ_k(i) repeatedly -> i' until is_even
***** Potentially epsilon_l steps, potentially 2 ** (epsilon*l) steps
***** Succ = go to next suffix
**** 2) Recurse SA(k+1)(epsilon * l) [even_rank_k(j)]
**** 3) Multiply by 2 ** (epsilon * l), subtract by number of steps in 1)
*** O(log ** (epsilon-prime) n)-query to SA[i]
*** Space is (1/epsilon + 6)n + O(n / loglog(N)) bits
** Optimizations
*** Note that at level 0, the successor structure is free
*** Store is_even_k as a succint dictionary(with rank)
**** Brodnik and Munro of L17
** Suffix array->tree
*** Store compressed binary trie on 2n + O(1) => 4n + o(n) bits
**** Balanced paren structure
*** In a search, we need to know the length of an edge. Need to know if we should branch/jump off
**** Look at subtrees and leftmost/rightmost leaf, look at interior-most leaves
***** Find leaves by finding longest match between left-most(y) and rightmost(y)
**** O(P * cost(Suffix array access))
** Succint suffix tree
*** Use suffix tree aboxe every bth suffix
*** Then the bottom chunks are just big O(N/B) bits
*** Use tiny lookup table, will be fine
*** Can be linear sweep
**** Look for all b of patterns at once in blocks,  you end up with a liner scan.
**** Will be absorbed O(P+b) time.

No comments:

Post a Comment