Monday, October 5, 2015

Strings and Succinct Data Structures

This last week I covered the "strings" section of the MIT lecture and the first of the two "succinct" data structure lectures. 
The strings section presented many of the problems that have motivated the succinct data structures that are covered in this course. I found the succinct data structure section very interesting. There are a few key ideas which underly a lot of the succinct data structures. Looking briefly into some of the papers mentioned shows this to be a sub-field with a history of mental gymnastics.
I'm quite a fan of the balanced parenthesis representation, as it represents the shape of the tree without naming its interior nodes. This is interesting, as it allows for memoization and other optimizations to arise out of the exposed generality.

* Strings
** Intro
*** Suffix trees
*** Tries, trays, compressed tries
** String matching
*** Given two strings, a text T and a pattern P
**** Both strings over an alphabet Sigma
*** Find the occurrences of the pattern in the text as a substring
*** Not looking at edit distance
*** Looking at a static data structure version of this, preprocess T
*** Query is the pattern.
*** Want O(P) query time
**** need to look at query but don't want to need to look at text
*** Want O(T) space
** Warmup: Predecessor problem among strings T_1, T_2 to T_n
*** Don't want to compare strings, they might be long.
*** Want to build a Trie
*** Trie
**** Trie = Rooted tree with labelled child branches.
***** In this case labelled with letters of the alphabet
*** Practical problem, substring searching in text or binary
*** Solution
**** We'd like to represent the T_i strings as root to leaf paths
***** In this case that would be a sequence of letters or the STRING
***** We add a symbol, $, to represent the end of every string.
**** In-order traversal of leaves = sorted strings.
**** Takes O(P) time to walk down
**** Must then do predecessor in node
*** Trie representation
**** Array, O(P) query, space is O(T * SIGMA)
***** Sigma space per node, T is number of nodes in tree or sum of string lengths
**** Balanced BST, O(P log sigma) query, space is O(T)
***** Height is log(sigma) query at node is P
**** Hash table: O(P) query time, O(T) space
***** Doesn't support predecessor queries, for that need vEB
***** Works much like an array, but has linear space.
***** Array is direct mapped hash table
**** van Emde Boas / y-fast: O(P loglog sigma) query, O(T) space
**** Trays
***** Achieve O(P + log(Sigma)) query, in O(T) space
***** Will support predecessor
***** Name combines tries and arrays
**** Weight balanced BST
***** O(P + log(K)) query. O(T) space
***** The weight of the subtree is the number of descendent leaves.
***** Optimally balance the right and left subtrees, make their weights constant factors of one another
****** Split in the middle
****** Recurse
***** Think of it as an array
****** Root is the center, left and right at 1/4 and 3/4, this continues.
****** Claim: Every two edges followed either advances P letter or reduces the number of candidate strings to 2/3
******* Two edges happens when we have a super-big subchild, this becomes a grandchild of the root
******* Charge to O(P) or O(log(k)), so implies O(log(k)) search
**** Leaf trimming
***** Indirection
***** Cut below maximally deep nodes with right number of descendants
****** Number of leaves of top tree is at least O( |T| / |sigma|)
****** Sub-tree is at least size sigma
***** Could do it on the number of branching nodes, O( |T| / |sigma| )
***** Since we reduce the number of leaves, our weight balanced BST is suddenly alright
****** Do it at the children
***** If you don't care about order of children, use hash table
*** Application: String sorting
**** Repeatedly insert into tries
**** O(T + k log sigma) to sort all strings, much better than mergesort
**** This is linear.
** Compressed Tries
*** Already did cover in signature sort
*** We get rid of non-branching nodes => compressed trie
*** Ex: If we have a level that has an edge that goes from a to one that goes to one other key only
**** Make the first edge take both the first and second key, and get rid of the intermediate node
**** - a -> - $ ->  becomes  - a $ ->
**** Number of internal nodes is numbers of leaves, so O(k) nodes
**** String representation: O(T) nodes in trie suddenly gets better because it's now O(k)
** Suffix trees
*** A compressed trie of all |T| suffixes T[i:] of T with $ appended
*** Goal: O(P) query, O(T) space
*** The root has a sorting of all suffixes, split into prefix/suffixes where applicable
*** All have a leaf node that points to the index into the string.
*** All indices from 0 to |String| are represented in trie.
*** Applications
**** Search for P gives subtree whose leaves correspond to all occurrences of P
***** O(P) time via hash + vEB
****** Permutes the children of every node
***** O(P + log(sigma)) via trays, leaves sorted in O(T) space
**** List first k occurrences in O(k) more time
***** Every node points to leftmost descendent left
***** Leaves connected via linked list for predecessor selection
**** Depth corresponds to how long string is
**** Want deepest node for longest repeated substring
**** Linear time
**** Longest common prefix between two nodes
***** In constant time!
***** LCA is lowest common ancestor
**** All occurrences of T[i=j] =
***** Weighted level ancestor j-i of T[i:] leaf
***** Need to deal with the fact that our compressed trie has more than one letter
****** This has O(loglog T) time in best case
****** Because with very long chains, you have to solve predecessor problem
****** Can't use lookup tables, can't afford to enumerate
****** Want to use van Emde Boas to represent long "ladders"
****** Now n is loglog(T), so O(loglog(T))
****** Nearly constant time
**** Multiple documents
***** Concatenate documents with special dollar sign differences
***** You'll see special substrings in common
***** You'll see different endings for different documents.
***** Document retrieval problem: O(P + # documents matching)
****** Not number of occurrences
****** Solution uses range minimum queries
** Document Retrieval
*** Want to return minimum number of sources, just check if source matches
*** Each $i stores leaf number of previous $i
*** Note: All dollar signs are leaves
*** Each dollar sign is associated with a different document.
*** In interval [l, n] of leaves, we want the first occurrence of $i for each i.
*** Then only need to pay for distinct matching documents
*** = $si whose stores value < l
*** Range Minimum Query -> find positive m with smallest stored #
**** If stored # < l, output i and recurse on [l, m-1] and [m+1, n]
** Suffix Arrays
*** Write down all suffixes, sort them
*** Can't explicitly represent them, but can store indices
*** $ comes before characters.
**** ex: $, a$, ana$, anana$, banana$, nana$ is sorted lexically
**** Can't store because quadratic size
**** Can store where the suffix the leaf represents resides in the tree. 
***** Do it sorted.
***** Ex: 6 5 3 1 0 4 2 for the banana thing above
**** O(T) space
*** How to build tree from array?
**** Cartesian tree of LCP
***** Cartesian tree in previous lecture, Converts RMQ into LCA.
**** Suffix Array: Sort suffixes of T, just store indices
**** O(T + sort(sigma)) construction
***** 1) Sort sigma
****** First time pay sort of sigma
****** Radix sort
***** 2) Replace each letter by it's rank in sigma
***** 3) Form T_0 = T[3i+k],t T_1 =  T[3i+1+k], T_2 = T[3i+2+k] for integer k, for 3 k
***** 4) Recurse on <T_0, T_1>, is 2/3 n
***** 5) Radix sort of T_2's suffixes
****** T_2[i:] = T[3i+2i] = <T[3i+2], T[3i+3]>,
***** 6) Merge suffixes (T_0 & T_1) with suffixes T_2
****** We can compare by stripping letters off of the smaller suffixes
****** We know how to compare those smaller suffixes
* Succinct I: Tries, Rank + Select
** Intro
*** Succinct binary tries
**** Level order
**** Via balanced parenthesis
*** Succinct rank & select
*** Linear space not optimal
*** Goal: "Small space"
**** Implicit: Optimal + O(1) bits
***** Ex: heap is a dynamic implicit data structure
***** Ex: A sorted array is implicit, supports binary search
**** Succinct: Optimal + small-o(Optimal) bits
**** Compact: O(Optimal) bits
** Previous work
*** Implicit dynamic search tree
**** Franceschini & Grossi
**** O(log(n)) worst-caste time/insert delete predecessor
**** O(log_b(n)) cache oblivious
*** Succinct dictionary: 
**** log(u choose n) bits to represent universe + O( (n ((loglog(n)) ** 2)) / log(n) )
**** Static, no insert/delete
**** Best known
**** O(1) membership query, but less space
*** Succinct Binary Trie: Munro & Raman
**** How many binary tries on N numbers are there?
**** The N-th Catalan number
***** C_n = (2n choose n) (n + 1) which is approx (4 ** n) such tries
***** Log of that is 2n bits, we can achieve 2n + o(n) bits
***** O(1) left child, right child, parent, subtree size
**** Was motivated by holding text, binary suffix tree
***** Encyclopedia needed to fit on CD, but needed to have fast arbitrary lookup
***** See: Strings lecture, (not yet covered)
**** O(1) insert, delete leaf & subdivide edge
*** Succinct k-ary trie: Farzan & menro
**** Number of k-ary tries is (kn + 1 choose n) / (kn + 1)
**** O(1) child with label i, parent, subtree size
***** With label i = contains a subtree with key i or has i
*** Succinct permutations: Munro, Raman, Raman, Rao
**** Use log(n)! + o(n)
**** O(log(n) / (loglog(n))) to get pi ** k(x) forall k
**** (1 + epsilon)nlogn bits, O(1) time
*** Abelian groups: Farzan & Munro
**** O(log(n)) bits for group of order n or n=|elements of group|.
**** O(1) multiply, inverse, equality testing
*** Graphs: Farzan & Munro
*** Implicit n-bit ints, inc/dec in O(log(n)) bit reads & O(1) bit writes: Munro & Rahman
*** https://cs.uwaterloo.ca/~imunro/pubs.htm
** Succinct Binary Tries
*** Level order representation
**** For each node in level order
***** Write a 0 or 1 if whether it has a left child and a 0/1 for whether it has a right child
**** Note: a lack of child on an edge is represented by an "external node" or a nullpointer
**** This bit representation is whether each node is an internal or external node in heap/array interpretations
***** 1 = internal node, 0 = external node
***** 2n+1 bits
***** (1) to include root, prepended, so 2n+1
**** Constant time left child, right child, parent
***** Lemma: left & right children of i-th internal node are at positions 2i & 2i+1
***** Proof: By induction on i.
****** Consider nodes i and i-1.
****** They will only have external nodes between them. So their children will be consecutive.
****** Works at same or at different levels.
****** So their children will be next to each other, and tree grows by 2, so node i will have 2i and 2i+1.
*** Rank & Select
**** In bit string
**** rank_1(i) = # of ones at or before position i
**** select_1(i) = position of j-th 1 bit
**** left-child(i) = 2(rank_1(i))
**** right-child(i) = 2(rank_1(i)) + 1
**** parent(i) = select(floor(i / 2))
**** Word Ram data structures, Integer data structure time
***** Want constant time, o(n) space.
***** Rank:
****** 1) Use lookup table for bit strings of length x => O((2 ** x) x log(x)) bits
******* x = (1/2)log(N) bits => O(sqrt(n) log(n) loglog(n))
******* Use indirection! (Inner routing structure to chunks of information)
****** 2) Split into ((log ** 2) (n))-size  chunks
******* Takes log(n) bits to store rank, writing down could get expensive
******* Splitting gets us O(n / log(n)) bits
****** 3) Split each chunk into (1/2)log(n)-size subchunks
******* Why not subdivide earlier? Because that would require more bits to write down
******* Each subchunk stores cumulative rank within chunk
******* This caps the number of bits for each rank to loglog(n) bits
******* Total size: O((n / log(n)) loglog(n)) bits
****** 4) Rank = rank of chunk + rank of subchunk + rank of element in subchunk
******* Possible to achieve O(n / (log ** k(n))) space, Patrascu
***** Select
****** 1) Use lookup table for bitstrings of length <= (1/2)log(N) => O(sqrt(n) log(n) loglog(n))
****** 2) Store array of indices of every (log(n) loglog(n))-th 1-bit
******* Take quotient and remainder to find which subchunk and which chunk to access
****** 3) Within a group of log(n) loglog(n) 1-bits, say r bits:
******* If r >= (log(n) loglog(n)) ** 2
********* Then store array of indices of 1-bits in group because it's sparse
********* O( (n / (nloglog(n))) * (log(n)loglog(n)) * log(n)) bits
********** Cancels to O(n / (loglog(n)))-bits which is slightly o(n)
******* Else we've got a small group, that can be split similarly to rank
******** Need to split more
******** Call this a "reduced" bit string
****** 4) Repeat steps 2) and 3) on all reduced bitstrings and further reduce to (loglog(n)) ** O(1)
******* 2') Store relative index of every ((loglog(n)) ** 2)-th 1-bit => O( ( n / ((loglog(n)) ** 2) * loglog(n)) = O(n / loglog(n)) bits
******* 3') If a group of loglog(N) ** 2 1-bits has r bits and if r is at least loglog(N) ** 4, then store all answers as relative indices
******** Each index is O(loglog(N)) bits
******** So space is O(N / (loglog(n) ** 4) * (loglog(N) ** 2) * loglog(N))
******** O( N / loglog(N)) bits
******* Else you are further reduced: loglog(N) ** 4 bits
******** Use lookup table, we're done. Use 1)
**** Dynamic versions: O(log(n) / loglog(n)) time per operation
*** Balanced parenthesis representation
**** More powerful for things like suffix trees
**** More like a DFS of the trie, less level-order
**** Binary Trie is Same thing as a rooted order tree
***** Don't distinguish between right and left
***** So we move everything to right spines with single right spines of each node
****** Left child becomes child of right spine at node that it connects
**** Is isomorphic to a balanced parenthesis representation
****** Ordered tree is each with balanced parens
****** Open paren for start to visit a node, closed paren for finish visit.
****** Catalan many balanced parens
****** Open parens are zeroes, closed parens are ones
****** Doesn't store labels, just structures.
**** Mapping

"Corresponds to"
| Binary Trie | Rooted Ordered Tree        | Balanced Parens                                                                                           |
| node        | node                       | First parenthesis                                                                                         |
| left child  | first child                | Next character if open paren, if next is closed, no children                                              |
| right child | next sibling               | Character after matching closing parenthesis                                                              |
| parent      | previous sibling OR parent | If previous character is closed paren, previous sibling is matching open paren. If open paren, is parent. |

**** O(1) time, as before, but we get subtree size
***** Subtree size in binary trie =
***** size(node) + size(right siblings) in rooted ordered tree =
***** 1/2 distance from open to close paren for node in balanced parens
**** Number of leaves in subtree in binary trie =
***** Balanced parens: Rank of enclosing close paren - rank of current open paren of node being considered.

No comments:

Post a Comment