Tries, Suffix Tree, and Suffix Array

Opeyemi M.
10 min readSep 25, 2020

Tries

Firstly introduced in 1959, by Rene De La Briandais in his paper “File searching using variable length Keys”, the name was later coined by Edward Fredkin in 1961 from the word RETRIEVAL, he pronounced it as (Tree) but most scholars prefer to pronounce (try) so as not to confuse it with the existing tree data structure even though it is a tree-based data structure itself.

It is described to be an ordered tree used in storing associative arrays, where the array keys are strings. The structures are based on the prefix of strings to support faster pattern matching.

Tries are till today very useful data structure when it comes to string autocomplete, Longest prefix matching, command suggestion in IDEs, pattern matching in DNA sequence search, and much more.

Google autocomplete/suggestion feature.

Unlike in Binary Trees, nodes in a trie don’t store keys associated with that node. Rather the position of the node determines the key to be associated with it.

Each leaf node in the tree indicates a particular string and all of its prefix. That means if two strings have similar prefix they traverse in the same path until they reach a character where they branch out.

A clear pictorial description is shown in the figure below.

Fig. 1: A pictorial description of a Trie representation.

In the figure above, the string ATE and ATOM both have shared prefix A-T, for this, they share the same path until the character O in the string ATOM, where it branches off and create its path.

From the Trie above, it is evident that traversing from the root node to a leaf node will yield a particular string and all its prefixes. This is very handful in dictionary implementation, as each character in the trie is sorted in lexicographical order.

From the trie in Fig. 1, we can easily see that the required to search for a particular string in a trie, is proportional to the size of the string and the size of the alphabet of the trie. Therefore, the running time can take O(d*m), where d and m are the sizes of the alphabet and the size of the string respectively.

This type of trie is referred to as the Standard Trie or simply Trie. Let us give a formal definition of a standard trie.

In a standard trie of string S, X is an ordered tree, where every node in X (except the root node) is identified with a character defined in the alphabet. The children of the internal nodes of X have distinct labels, and the leaf nodes form a string when traversed from the root.

An example is shown in the figure below, given a string S = {bear, bell, bid, bull, buy, sell, stock, stop}.

Fig. 2: Standard Trie representation.

Comparison with BST and HashMap

Performing a search operation in a Binary Search Tree (BST) takes O(S*log(N)) time, where S is the length of the string to be searched, and N is the number of keys. But if we perform the same search operation in a Trie it will require only O(S).

Searching in HashMap will yield O(S) time complexity, but with the collision effect in Hashing, searching can take up to O(N) time. In this case, a trie will have a faster runtime complexity when searching for strings (prefix preferably).

The trade-off in a trie is that it is not space-efficient, because it stores every character as a node pointing to the next suffix character in the string. But in a scenario where there is a large number of short keys, because of the shared prefix path of a trie, it becomes a space-efficient algorithm to implement.

In summary, a standard trie may have faster runtime complexity when compared to some self-balancing BST or HashMaps, but they space-inefficient. This drawback led to the introduction of a new variation of Tries, which is known as the Compressed Tries.

Compressed Tries

To address the issue of the large space consumption drawbacks of the standard tries, the compressed trie used a simple but efficient method to handle redundant nodes in the trie. So what are redundant nodes? These are internal nodes that contain a single child. In Fig. 2, while traversing through sell, the node “e” has one child node “l”. Also, node “l” has another single child node “l”. Lots of space will be saved if such nodes can be compressed such that, all non-root nodes that contain single-child are merged to a single node. And instead of using an individual character to label each node, a compressed trie will use the merged sub-strings to represent a particular node. Let us look at Fig. 3 below, to illustrates what the outcome of Fig.2 will look like if after making it a compressed trie.

Fig. 3: Compressed Trie

From Fig. 3 it is evident that each internal node has been constrained to have at least two nodes, thereby reducing the space required. Formally stated in the theorem below.

A compressed trie storing a collection S of s strings from an alphabet of size d has the following properties:

Every internal node of T has at least two children and at most d children

T has s external nodes

The number of nodes of T is O(s).

Suffix Trees

Before discussing the suffix tree in detail, let us establish what are prefix and suffix;

From a given string S = S1, S2, S3,….Sn, the prefix of such string S are the strings from S1….Si, where i=[1…n].

Also, given the same string S, the suffix of S are strings from Si…Sn, where i=[n…1].

A clear demonstration of this is shown with an example.

Given a string S = banana.

The prefixes of S are; b, ba, ban, bana, banan, banana.

While the suffixes are; a, na, ana, nana, anana, banana.

Now let us discuss what exactly are Suffix Trees?

The suffix tree also referred to as the PAT tree or position tree, is another variation of tries. From its name, it is easy to imagine that this kind of trie cares more about the suffix of a given string. It is a static structure that does some preprocessing of large string S for a faster matching of any sub-string of S. It stores the suffixes of a string as its keys, while the position of the suffixes in the string as its values. A proper definition for a suffix tree is given below.

A suffix tree of a string S of n length satisfies the following:

- The number of leaves in the tree is exactly the length n of the string.

- Since it is a compressed trie, all internal nodes must have at least two children.

- Edges are labeled with sub-string (non-empty) of S.

- Edges from the same node should have unique labels.

- Any path from the root to leaf should represent the suffixes of S.

Algorithm

1: Let X be the suffixes of the given string S. At the end of each batch of suffix add a $ (dollar sign) to denote the end of each suffix.

2: Sort the suffixes in lexicographical order.

3: In each group Xk (for every k ∈ ∑, where ∑ is the set of alphabets)

(a) If Xk contains only a single element, then create a leaf node for it.

(b) Else, search for its Longest Common Prefix (LCP) and create an internal node for it.

(c)Repeat step 2. X being remaining suffixes from step 3(b) after splitting the LCP.

Examples

Assume we are given a string S = tatat.

Firstly:

Let’s index each suffix to indicate their original position in the string.

Next:

In a lexicographical order (dictionary order), let us sort the suffixes from Step 1.

After successfully sorting the suffixes, we have grouped them such that, suffixes having similar prefixes belongs to the same group.

Next:

Group S1 has only a single element so we apply Step 3 (a) from the algorithm and create a leaf node for it.

Next:

Group S2 and S3, fits in Step 3 (b) of the algorithm, the table below shows their Longest Common Prefixes.

From the table above, we can therefore proceed with Step 3 (b) of the algorithm and create internal nodes for each of the two groups and insert on their edges the longest common prefix in their group.

Next:

We should eliminate the LCP in S2 and S3.

Next:

We recursively perform Step 2 and 3 in each group, starting with S2, after performing a lexicographical sort, we realize the two new subgroups will be found, the first group will contain $ and the second group will contain at$. As they both contain single elements then Step 3 (a) is performed.

We recursively continue in this fashion until no elements are left in each group.

The complexity of Suffix Trees

Keep in mind that the algorithm which we have used above is not the most efficient Suffix Tree construction algorithm. In the above algorithm, we used n distinct suffixes from any given string, where n is the length of the string, the largest suffix has a length of n, and the next after that has n-1, followed by n-2, and so on until 1.

This yields both the time and space complexity of O(n²). Although other complex compression techniques can be implored on Suffix Trees that allow them to use linear space O(n), one such technique is the Compact representation.

Applications of suffix trees

Suffix Trees are practically relevant in many application area, some of them are;

a. Pattern matching

b. Palindrome matching

c. Longest Common sub-string

d. Longest Common Prefix

e. Longest Repeated sub-string

Suffix Arrays

The suffix arrays was introduce by Udi Manber, and Egen Myers Jr. in 1990, to complement the inefficiency of suffix trees to manage space. It is a simple data structure and easy to implement. In simple terms, the suffix array of a given string S is sorted (lexicographical) array of integers representing the order/position of the suffixes in S.

Formally, let S = S[1]S[2]…S[n] be a string of n element, and S[i,j] represents a sub-string of S from ith element to the jth element. The suffix array A of S is simply an array of integer showing the starting position of S in an alphabetic/lexicographical order.

Example

Given our favorite string S = banana.

Firstly, lets indexed the string in an array to know the position.

Next, we have to sort the string in a lexicographical order.

The suffix array indexing the start position of each suffix will be;

From the array, we can see that array A[7] contains the value 3 and this signifies the position at which the suffix belongs in the original string, that is, b a n a n a $ (the third position).

Construction of Suffix Array

We can do this using different sophisticated and/or easy approaches. Naively, we can simply sort all the suffixes using the comparison-based method which uses O(n*logn) suffix comparison, and since each suffix comparison runs at O(n) times, overall it requires O(n²*logn).

But, suffix array can be constructed easily from a suffix tree. And since we have discussed earlier that suffix trees can be constructed (using compression technique) in linear time O(n), where n is the length of the string. Simple depth-first traversal of the suffix tree will yield us a suffix array. Let’s see from the example below.

Example

Given a string S = cattcat, where the alphabet contains (a,c,t).

Step 1:

Firstly, we construct the suffix tree for the string.

Step 2:

Perform a depth-first traversal on the suffix tree (scanning each node in lexicographical order).

The output of this will be: 86251743

NOTICE:

Red integers indicate the starting position of the suffix ending at that leaf.

Step 3:

Represent the leaf in an array, keeping the order intact, (i.e. 8 comes before 6). The array will automatically be sorted in lexicographical order. The image below depicts a clear explanation.

There are other algorithms like the Skew Algorithm, that achieve the construction of suffix array in linear time also.

Space comparison

In the suffix tree, the space required is proportional to the number of characters in the string, O(n) where n is the number of characters in the string.

For example, during the implementation of an algorithm, if each character requires approximately 20 bytes. That means if we have, say, 1GB size of data to represent with a suffix tree, we will consume approximately 1000 X 20 bytes = 20GB of space. But in a suffix array, since we only need to store the integers indexing the positions of each suffix in the string, and an integer will cost us 4 bytes of memory space, therefore, while representing the same genomics in the suffix array, we will require approximately 1000 X 4 bytes = 4GB of space instead of the 20GB used in suffix tree.

So far, the three data structure that has been discussed so far, although they have been discovered for quite a while now, they are still very much relevant in lots of the applications that we use in our daily life. I hope this article gives you a friendly walk-through to these data structures.

--

--