---
title: "Context trees"
output: 
  rmarkdown::html_vignette:
     df_print: kable
vignette: >
  %\VignetteIndexEntry{Context trees}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

Variable Length Markov Chain (VLMC) models provide parsimonious high order Markov chains which can have a finite but long memory without suffering from the computational and estimation problems associated to dense high order Markov chain. This is achieved using the notion of context.

## Contexts
### Definition

We consider a time series $x=(x_i)_{i\geq 1}$ with values in a finite set $S$, its state space. A context $c$ is a finite sequence of elements of $S$, $c=(c_k, \ldots, c_1)$ _observed_ in $x$. $c$ is _observed_ in $x$ if there is $t$ such that
$$
x_{t-1}=c_1, x_{t-2}=c_2, \ldots, x_{t-k}=c_k.
$$
Notice that $c$ is numbered in a _reverse_ order, in the sense that $c_1$ is
the most recent value in $x$ while $c_k$ is the oldest one. Thus the
sub-sequence observed in $x$. In numerous papers, for instance in the works 
of Bühlmann and Wyner, the convention is to reverse the temporal order and to 
write $c=(c_1, \ldots, c_k)$, keeping $c_1$ as the most recent value and $c_k$
as the oldest one. In `mixvlmc` we use the more natural convention of writing
contexts in temporal order, but many functions offer a `reverse` parameter 
than can be used to switch to Bühlmann and Wyner convention. 

Back to examples, if $S=\{0, 1\}$ and $x=(0, 0, 0, 1, 1, 1)$

- $(0, 0)$ is a context of $x$ with $t=3$ and $t=4$;
- $(0, 1, 0)$ is not a context of $x$.

### Context trees

The contexts of a time series can be represented by a tree. The root of the tree stands for the empty context. The children of the root represent the
contexts of length 1. In general, if a node represents the context $c=(c_k,
\ldots, c_1)$, then contexts of the form $c'=(c_{k+1}, c_k, \ldots, c_1)$ are
represented by the children of the node. Descending in the context tree 
corresponds to adding to the past of the context. 

Let us consider again $x=(0, 0, 0, 1, 1, 1)$ and all contexts that appear _at least twice_ in $x$ (i.e. which are observed for at least two different values of $t$). An ASCII art representation of the corresponding context tree is (in pure ASCII):
```
*
+-- 0
|   '-- 0
'-- 1
```

The tree represents 2 size one contexts ($(0)$ and $(1)$), the direct children of the root (shown as a star `*`). It represents in addition 1 size 2 contexts, $(0, 0)$. Notice that for instance, $(1, 0)$ is not a context in the tree as the node of context $(0)$ has only one child labelled by $0$. 

## Extracting contexts from a time series in mixvlmc
Mixvlmc can be used to compute all the contexts of a time series using the `ctx_tree()` function/constructor as follows:
```{r}
x <- c(0, 0, 0, 1, 1, 1)
library(mixvlmc)
x_ctx <- ctx_tree(x)
x_ctx
```
The result of `ctx_tree()` is a `ctx_tree` object. It can be drawn using ascii art
```{r}
draw(x_ctx)
```

The default extraction is done with `min_size=2` and `max_depth=10` which means that

- contexts are included only if they appear at least twice in the time series;
- the maximum length of a context is 10 (the term _depth_ is used in reference to the tree representation).

Notice that the number of potential contexts grows exponentially  with the length of the time series and it is therefore advisable to keep `max_depth` to a reasonable value. Let us consider a simple example.

```{r}
set.seed(0)
y <- sample(c("a", "b", "c"), 100, replace = TRUE)
y_ctx_def <- ctx_tree(y)
y_ctx_def
```

With the default parameters, we end up with already `r context_number(y_ctx_def)` contexts. Setting `min_size=1` gives an unreasonable number of contexts:
```{r}
y_ctx_min_1 <- ctx_tree(y, min_size = 1)
y_ctx_min_1
```
Even if we decrease the depth limit the number of contexts remains very large:
```{r}
y_ctx_min_1_d_15 <- ctx_tree(y, min_size = 1, max_depth = 15)
y_ctx_min_1_d_15
```

Contexts can be extracted from a context tree using the `contexts()` function as follows:
```{r}
contexts(x_ctx)
```
In general, the raw list of contexts is not directly useful and one should use the 
node manipulation functions to leverage it (see below). A simple approach consists 
in asking to `contexts()` a `data.frame` output that will contain additional information
about the contexts. This is done implicitly when additional parameters are given to
`contexts()`. In the simple case of `ctx_tree`, setting the `frequency` 
parameter to `"total"` or `"detailed"` gives access to the distribution of $x_t$ for 
all the $t$ at which a context appears. 
```{r}
contexts(x_ctx, frequency = "total")
```
With `frequency = "total"`, we obtain a data frame with a column `freq` that contains the number of occurrences of each context. 
```{r}
contexts(x_ctx, frequency = "detailed")
```

With `frequency = "detailed"`, we obtain _in addition_ a column for each value in the state space $S$ which contains the distribution of $x_t$ for the occurrences of each context. For instance in the table above, the context $(0, 0)$ appears twice in $x$ and is followed once by $0$ and once by $1$.

## Direct manipulation of nodes
Another way to extract information from a context tree, especially large ones, it
to operate at the node level, using the `find_sequence()` function (or the 
`contexts()` function). For instance exploring the `r context_number(y_ctx_min_1)`
contexts obtained above with `min_size=1` is not convenient, but we may be interested
by e.g. the sequence `c("a", "a", "a")`. We look for a corresponding node in the tree
with
```{r}
node_aaa <- find_sequence(y_ctx_min_1, c("a", "a", "a"))
node_aaa
```

As the result is not `NULL`, we know that the sequence appears in the original 
time series. It is not a context, as shown by
```{r}
is_context(node_aaa)
```
In this particular case, it is likely to be caused by longer contexts for which 
`c("a", "a", "a")` is a suffix. This can be verified by looking at the children
of `node_aaa`:
```{r}
children(node_aaa)
```
This shows that we have indeed three contexts that all end by `c("a", "a", "a")`.

Nodes carry information about the sequences or contexts they represent. The number
of occurrences of a sequence in the original time series is obtained with the
`counts()` function as follows:
```{r}
counts(node_aaa, frequency = "total")
```
Notice that as with `contexts()` those occurrences cover only positions where the
sequence is followed by at least one value. The distribution of those values is
given by another call to `counts()`:
```{r}
counts(node_aaa, frequency = "detailed")
```
If sequence positions were saved during the construction of the context tree, the
`ctx_node` can report them using the `positions()` function:
```{r}
positions(node_aaa)
```
See the documentation of the function for the definition of a position.