Burrows-Wheeler-Transformation

The bwt can be read off of the last column of all sorted text rotations of a given text $T$. Let $n\in\mathbb{N}$, then we define $s:\Sigma^n\to\Sigma^n$ as function that permutates it's input text. Specifically, it takes/pops the first character and appends it to the end. Now, for convenience, $s_i$ is just $s$ applied $i$ times. Let me illustrate what we essentially have to do:

\[ \begin{array}{rl} i & s\_i(T) \\ \hline 1 & \mathtt{mississippi\$} \\ 2 & \mathtt{ississippi\$m} \\ 3 & \mathtt{ssissippi\$mi} \\ 4 & \mathtt{sissippi\$mis} \\ 5 & \mathtt{issippi\$miss} \\ 6 & \mathtt{ssippi\$missi} \\ 7 & \mathtt{sippi\$missis} \\ 8 & \mathtt{ippi\$mississ} \\ 9 & \mathtt{ppi\$mississi} \\ 10 & \mathtt{pi\$mississip} \\ 11 & \mathtt{i\$mississipp} \\ 12 & \mathtt{\$mississippi} \\ \end{array}\overset{sort}{\Rightarrow} \begin{array}{rl} i & s\_i(T) \\ \hline 12 & \mathtt{\$mississipp{\color{green}i}}\\ 11 & \mathtt{i\$mississip{\color{green}p}}\\ 8 & \mathtt{ippi\$missis{\color{green}s}}\\ 5 & \mathtt{issippi\$mis{\color{green}s}}\\ 2 & \mathtt{ississippi\${\color{green}m}}\\ 1 & \mathtt{mississippi{\color{green}\$}}\\ 10 & \mathtt{pi\$mississi{\color{green}p}}\\ 9 & \mathtt{ppi\$mississ{\color{green}i}}\\ 7 & \mathtt{sippi\$missi{\color{green}s}}\\ 4 & \mathtt{sissippi\$mi{\color{green}s}}\\ 6 & \mathtt{ssippi\$miss{\color{green}i}}\\ 3 & \mathtt{ssissippi\$m{\color{green}i}}\\ \end{array} \]

Our bwt is then $\mathtt{ipssm\$pissii}$. Note that $\$$ is a special character that stands for the end of the string. A very interesting connection exists between suffix arrays and the bwt. To investigate this further, let's make clear what a suffix array is: We say $\mathtt{pos}$ is an array of positions $\left\{0,\dotsc,n-1\right\}$ which represent the beginning of a suffix. They are ordered lexicographically. So, in a similar fashion to the bwt, we can construct the suffix array:

\[ \begin{array}{rl} i & \operatorname{suff}_i(T) \\ \hline 0 & \mathtt{mississippi\$} \\ 1 & \mathtt{ississippi\$} \\ 2 & \mathtt{ssissippi\$} \\ 3 & \mathtt{sissippi\$} \\ 4 & \mathtt{issippi\$} \\ 5 & \mathtt{ssippi\$} \\ 6 & \mathtt{sippi\$} \\ 7 & \mathtt{ippi\$} \\ 8 & \mathtt{ppi\$} \\ 9 & \mathtt{pi\$} \\ 10 & \mathtt{i\$} \\ 11 & \mathtt{\$} \\ \end{array}\overset{sort}{\Rightarrow} \begin{array}{rl} \downarrow\mathtt{pos} & \operatorname{suff}_i(T) \\ \hline 11 & \mathtt{\$}\\ 10 & \mathtt{i\$}\\ 7 & \mathtt{ippi\$}\\ 4 & \mathtt{issippi\$}\\ 1 & \mathtt{ississippi\$}\\ 0 & \mathtt{mississippi\$}\\ 9 & \mathtt{pi\$}\\ 8 & \mathtt{ppi\$}\\ 6 & \mathtt{sippi\$}\\ 3 & \mathtt{sissippi\$}\\ 5 & \mathtt{ssippi\$}\\ 2 & \mathtt{ssissippi\$}\\ \end{array} \]

Now, a very interesting connection between the bwt and the suffix array hides behind a corner. Take the sorted suffixes and prepend the character right before it. So, if we have the suffix $\mathtt{ssippi}$, we transform it to $\mathtt{{\color{green}i}ssippi}$. A special case for the non-whole-suffix $\mathtt{mississippi\$}$: We just prepend the $\$$ to it. The final table will look like this: (left, bwt is right)

\[ \begin{array}{lr} \begin{array}{rl} \downarrow\mathtt{pos} & {\color{green}c}+\operatorname{suff}_i(T) \\ \hline 11 & \mathtt{{\color{green}i}\$}\\ 10 & \mathtt{{\color{green}p}i\$}\\ 7 & \mathtt{{\color{green}s}ippi\$}\\ 4 & \mathtt{{\color{green}s}issippi\$}\\ 1 & \mathtt{{\color{green}m}ississippi\$}\\ 0 & \mathtt{{\color{green}\$}mississippi\$}\\ 9 & \mathtt{{\color{green}p}pi\$}\\ 8 & \mathtt{{\color{green}i}ppi\$}\\ 6 & \mathtt{{\color{green}s}sippi\$}\\ 3 & \mathtt{{\color{green}s}sissippi\$}\\ 5 & \mathtt{{\color{green}i}ssippi\$}\\ 2 & \mathtt{{\color{green}i}ssissippi\$}\\ \end{array} & \begin{array}{rl} i & s_i(T) \\ \hline 12 & \mathtt{\$mississipp{\color{green}i}}\\ 11 & \mathtt{i\$mississip{\color{green}p}}\\ 8 & \mathtt{ippi\$missis{\color{green}s}}\\ 5 & \mathtt{issippi\$mis{\color{green}s}}\\ 2 & \mathtt{ississippi\${\color{green}m}}\\ 1 & \mathtt{mississippi{\color{green}\$}}\\ 10 & \mathtt{pi\$mississi{\color{green}p}}\\ 9 & \mathtt{ppi\$mississ{\color{green}i}}\\ 7 & \mathtt{sippi\$missi{\color{green}s}}\\ 4 & \mathtt{sissippi\$mi{\color{green}s}}\\ 6 & \mathtt{ssippi\$miss{\color{green}i}}\\ 3 & \mathtt{ssissippi\$m{\color{green}i}}\\ \end{array} \end{array} \]

Now it is very easy to verify that $bwt[i] = T[pos[i] - 1]$, ignoring boundary conditions.

Avatar
Matthis Kruse
Masters Student

My research interests include programming languages and compilers.

Next
Previous