Variable Length Markov Chain (VLMC) models provide parsimonious high order Markov chains which can have a finite but long memory without suffering from the computational and estimation problems associated to dense high order Markov chain. This is achieved using the notion of context.
We consider a time series x = (xi)i ≥ 1
with values in a finite set S,
its state space. A context c
is a finite sequence of elements of S, c = (ck, …, c1)
observed in x. c is observed in x if there is t such that xt − 1 = c1, xt − 2 = c2, …, xt − k = ck.
Notice that c is numbered in a
reverse order, in the sense that c1 is the most recent
value in x while ck is the oldest
one. Thus the sub-sequence observed in x. In numerous papers, for instance
in the works of Bühlmann and Wyner, the convention is to reverse the
temporal order and to write c = (c1, …, ck),
keeping c1 as the
most recent value and ck as the oldest
one. In mixvlmc
we use the more natural convention of
writing contexts in temporal order, but many functions offer a
reverse
parameter than can be used to switch to Bühlmann
and Wyner convention.
Back to examples, if S = {0, 1} and x = (0, 0, 0, 1, 1, 1)
The contexts of a time series can be represented by a tree. The root of the tree stands for the empty context. The children of the root represent the contexts of length 1. In general, if a node represents the context c = (ck, …, c1), then contexts of the form c′ = (ck + 1, ck, …, c1) are represented by the children of the node. Descending in the context tree corresponds to adding to the past of the context.
Let us consider again x = (0, 0, 0, 1, 1, 1) and all contexts that appear at least twice in x (i.e. which are observed for at least two different values of t). An ASCII art representation of the corresponding context tree is (in pure ASCII):
*
+-- 0
| '-- 0
'-- 1
The tree represents 2 size one contexts ((0) and (1)), the direct children of the root (shown
as a star *
). It represents in addition 1 size 2 contexts,
(0, 0). Notice that for instance, (1, 0) is not a context in the tree as the
node of context (0) has only one child
labelled by 0.
Mixvlmc can be used to compute all the contexts of a time series
using the ctx_tree()
function/constructor as follows:
x <- c(0, 0, 0, 1, 1, 1)
library(mixvlmc)
x_ctx <- ctx_tree(x)
x_ctx
#> Context tree on 0, 1
#> Number of contexts: 3
#> Maximum context length: 2
The result of ctx_tree()
is a ctx_tree
object. It can be drawn using ascii art
The default extraction is done with min_size=2
and
max_depth=10
which means that
Notice that the number of potential contexts grows exponentially with
the length of the time series and it is therefore advisable to keep
max_depth
to a reasonable value. Let us consider a simple
example.
set.seed(0)
y <- sample(c("a", "b", "c"), 100, replace = TRUE)
y_ctx_def <- ctx_tree(y)
y_ctx_def
#> Context tree on a, b, c
#> Number of contexts: 77
#> Maximum context length: 6
With the default parameters, we end up with already 77 contexts.
Setting min_size=1
gives an unreasonable number of
contexts:
y_ctx_min_1 <- ctx_tree(y, min_size = 1)
y_ctx_min_1
#> Context tree on a, b, c
#> Number of contexts: 4607
#> Maximum context length: 99
Even if we decrease the depth limit the number of contexts remains very large:
y_ctx_min_1_d_15 <- ctx_tree(y, min_size = 1, max_depth = 15)
y_ctx_min_1_d_15
#> Context tree on a, b, c
#> Number of contexts: 1037
#> Maximum context length: 15
Contexts can be extracted from a context tree using the
contexts()
function as follows:
In general, the raw list of contexts is not directly useful and one
should use the node manipulation functions to leverage it (see below). A
simple approach consists in asking to contexts()
a
data.frame
output that will contain additional information
about the contexts. This is done implicitly when additional parameters
are given to contexts()
. In the simple case of
ctx_tree
, setting the frequency
parameter to
"total"
or "detailed"
gives access to the
distribution of xt for all the
t at which a context
appears.
With frequency = "total"
, we obtain a data frame with a
column freq
that contains the number of occurrences of each
context.
contexts(x_ctx, frequency = "detailed")
#> context freq 0 1
#> 1 0, 0 2 1 1
#> 2 0 3 2 1
#> 3 1 2 0 2
With frequency = "detailed"
, we obtain in
addition a column for each value in the state space S which contains the distribution of
xt for the
occurrences of each context. For instance in the table above, the
context (0, 0) appears twice in x and is followed once by 0 and once by 1.
Another way to extract information from a context tree, especially
large ones, it to operate at the node level, using the
find_sequence()
function (or the contexts()
function). For instance exploring the 4607 contexts obtained above with
min_size=1
is not convenient, but we may be interested by
e.g. the sequence c("a", "a", "a")
. We look for a
corresponding node in the tree with
node_aaa <- find_sequence(y_ctx_min_1, c("a", "a", "a"))
node_aaa
#> Sequence [T]: a, a, a
#> followed by a (3), b (2), c (2)
As the result is not NULL
, we know that the sequence
appears in the original time series. It is not a context, as shown
by
In this particular case, it is likely to be caused by longer contexts
for which c("a", "a", "a")
is a suffix. This can be
verified by looking at the children of node_aaa
:
children(node_aaa)
#> $a
#> Context [T]: a, a, a, a
#> followed by a (0), b (1), c (2)
#>
#> $b
#> Context [T]: b, a, a, a
#> followed by a (1), b (0), c (0)
#>
#> $c
#> Context [T]: c, a, a, a
#> followed by a (2), b (1), c (0)
This shows that we have indeed three contexts that all end by
c("a", "a", "a")
.
Nodes carry information about the sequences or contexts they
represent. The number of occurrences of a sequence in the original time
series is obtained with the counts()
function as
follows:
Notice that as with contexts()
those occurrences cover
only positions where the sequence is followed by at least one value. The
distribution of those values is given by another call to
counts()
:
If sequence positions were saved during the construction of the
context tree, the ctx_node
can report them using the
positions()
function:
See the documentation of the function for the definition of a position.