Abstract

With the package mHMMbayes you can fit multilevel hidden Markov models. The multilevel hidden Markov model (HMM) is a generalization of the well-known hidden Markov model, tailored to accommodate (intense) longitudinal data of multiple individuals simultaneously. Using a multilevel framework, we allow for heterogeneity in the model parameters (transition probability matrix and conditional distribution), while estimating one overall HMM. The model has a great potential of application in many fields, such as the social sciences and medicine. The model can be fitted on multivariate data with a categorical distribution, and include individual level covariates (allowing for e.g., group comparisons on model parameters). Parameters are estimated using Bayesian estimation utilizing the forward-backward recursion within a hybrid Metropolis within Gibbs sampler. The package also includes a function to simulate data and a function to obtain the most likely hidden state sequence for each individual using the Viterbi algorithm.

Hidden Markov models (HMMs; Rabiner 1989) are a machine learning method that have been used in many different scientific fields to describe a sequence of observations for several decades. For example, translating a fragment of spoken words into text (i.e., speech recognition, see e.g. Rabiner 1989; Woodland and Povey 2002), or the identification of the regions of DNA that encode genes (i.e., gene tagging, see e.g., Krogh, Mian, and Haussler 1994; Henderson, Salzberg, and Fasman 1997; Burge and Karlin 1998). The development of this package is, however, motivated from the area of social sciences. Due to technological advancements, it becomes increasingly easy to collect long sequences of data on behavior. That is, we can monitor behavior as it unfolds in real time. An example here of is the interaction between a therapist and a patient, where different types of nonverbal communication are registered every second for a period of 15 minutes. When applying HMMs to such behavioral data, they can be used to extract latent behavioral states over time, and model the dynamics of behavior over time.

A quite recent development in HMMs is the extension to multilevel HMMs (see e.g., Altman 2007; Shirley et al. 2010; Rueda, Rueda, and Diaz-Uriarte 2013; Zhang and Berhane 2014; Haan-Rietdijk et al. 2017). Using the multilevel framework, we can model several sequences (e.g., sequences of different persons) simultaneously, while accommodating the heterogeneity between persons. As a result, we can quantify the amount of variation between persons in their dynamics of behavior, easily perform group comparisons on the model parameters, and investigate how model parameters change as a result of a covariate. For example, are the dynamics between a patient and a therapist different for patients with a good therapeutic outcome and patients with a less favorable therapeutic outcome?

With the package `mHMMbayes`

, one can estimate these multilevel hidden Markov models. This tutorial starts out with a brief description of the HMM and the multilevel HMM. For a more elaborate and gentle introduction to HMMs, we refer to Zucchini, MacDonald, and Langrock (2016). Next, we show how to use the package `mHMMbayes`

through an extensive example, also touching on the issues of determining the number of hidden states and checking model convergence. Information on the used estimation methods and algorithms in the package is given in the vignette Estimation of the multilevel hidden Markov model.

We illustrate using the package using the embedded example data `nonverbal`

. The data contains the nonverbal communication of 10 patient-therapist couples, recorded for 15 minutes at a frequency of 1 observation per second (= 900 observations per couple). The following variables are contained in the dataset:

`id`

: id variable of patient - therapist couple to distinguish which observation belongs to which couple.`p_verbalizing`

: verbalizing behavior of the patient, consisting of 1 = not verbalizing, 2 = verbalizing, 3 = back channeling.`p_looking`

: looking behavior of the patient, consisting of 1 = not looking at therapist, 2 = looking at therapist.`t_verbalizing`

: verbalizing behavior of the therapist, consisting of 1 = not verbalizing, 2 = verbalizing, 3 = back channeling.`t_looking`

: looking behavior of the therapist, consisting of 1 = not looking at patient, 2 = looking at patient. The top 6 rows of the dataset are provided below.

When we plot the data of the first 5 minutes (= the first 300 observations) of the first couple, we get the following:

We can, for example, observe that both the patient and the therapist are mainly looking at each other during the observed 5 minutes. During the first minute, the patient is primarily speaking. During the second minute, the therapists starts, after which the patient takes over while the therapist is back channeling.

To fit a simple 2 state multilevel model with the function `mHMM`

, one first has to specify some general model properties and starting values:

```
library(mHMMbayes)
# specifying general model properties:
m <- 2
n_dep <- 4
q_emiss <- c(3, 2, 3, 2)
# specifying starting values
start_TM <- diag(.8, m)
start_TM[lower.tri(start_TM) | upper.tri(start_TM)] <- .2
start_EM <- list(matrix(c(0.05, 0.90, 0.05,
0.90, 0.05, 0.05), byrow = TRUE,
nrow = m, ncol = q_emiss[1]), # vocalizing patient
matrix(c(0.1, 0.9,
0.1, 0.9), byrow = TRUE, nrow = m,
ncol = q_emiss[2]), # looking patient
matrix(c(0.90, 0.05, 0.05,
0.05, 0.90, 0.05), byrow = TRUE,
nrow = m, ncol = q_emiss[3]), # vocalizing therapist
matrix(c(0.1, 0.9,
0.1, 0.9), byrow = TRUE, nrow = m,
ncol = q_emiss[4])) # looking therapist
```

The first line of code loads the `mHMMbayes`

package and the `nonverbal`

data. Next we specify the general model properties: the number of states used is set by `m <- 2`

, the number of dependent variables in the dataset used to infer the hidden states is specified by `n_dep <- 4`

, and the number of categorical outcomes for each of the dependent variables is specified by `q_emiss <- c(3, 2, 3, 2)`

.

The subsequent lines of code specify starting values for both the transition probability matrix and the emission distribution(s), which are given to the model in the argument `start_val`

(see next piece of code). These starting values are used for the first run of the forward backward algorithm. Although the hidden states cannot be observed, one often has an idea for probable compositions of the states. In the example, we expect that there is a state in which the patient mostly speaks, and the therapist is silent, and a state during which the patient is silent and the therapists speaks. In addition, we expect that during both states, the therapist and the patient will be mainly looking at each other instead of looking away. One usually also has some (vague) idea on likely and unlikely switches between states, and the size of self-transition probabilities. In the example, we think a state will usually last quite some seconds, and thus expect a rather high self-transition probability. All these ideas can be used to construct a set of sensible starting values. Using sensible starting values increases convergence speed, and often prevents a problem called ‘label switching’. Hence, using random or uniform starting values is not recommended, and a default option to do so is not included in the package. Note that it is strongly advised to check model convergence and label switching. That is, one should check if the algorithm reaches the same solution when a set of different (but often conceptually similar) starting values are used, and if label switching is not a problem. See the section *Checking model convergence and label switching* for an example. See the vignette Estimation of the multilevel hidden Markov model) for more information on the forward backward algorithm and on the problem of label switching.

As the estimation proceeds within a Bayesian context, a (hyper-)prior distribution has to be defined for the group level parameters, i.e., the group level emission and transition probabilities. Default, non-informative priors are used unless specified otherwise by the user. Below, we present some information on this. For a more elaborate explanation on the used (hyper-)prior distributions and their parameters, see the vignette Estimation of the multilevel hidden Markov model.

First of all, note that the prior distributions on the emission distribution and transition probabilities are not on the probabilities directly, but on the intercepts (and regression coefficients given that covariates are used) of the multinomial regression model used to accommodate the multilevel framework of the data. Second, parameters do not each have their own independent prior distribution. As each row of the emission distribution matrix and transition probability matrix sum to one, the individual parameters of these rows are connected. Hence, each row is seen as a set of parameters which are estimated jointly, and each set of parameters has a multivariate prior distribution.

The sets of intercepts of the multinomial regression model have a multivariate normal distribution. The (hyper-) prior for these intercepts thus consists of a prior distribution on the vector of means, and a prior distribution on the covariance matrix. The hyper-prior for the mean intercepts is a multivariate Normal distribution, with, as default, a vector of means equal to 0, and a parameter \(K_0\) with which the covariance matrix is multiplied. Here, \(K_0\) denotes the number of observations (i.e., the number of hypothetical prior subjects) on which the prior mean vector of zero’s is based. By default, \(K_0\) is set to 1. The hyper-prior for the covariance matrix between the set of (state specific) intercepts has an Inverse Wishart distribution, for which the variance in the default setting equals 3 + \(m\) - 1 for the transition probabilities and 3 + \(q\) - 1 for the emission probabilities, and the covariance equals 0. The degrees of freedom of the Inverse Wishart distribution is set to 3 + \(m\) - 1 for the transition probabilities and 3 + \(q\) - 1 for the emission probabilities.

To specify user specific prior distributions, one uses the input option `emiss_hyp_prior`

for the emission distribution and `gamma_hyp_prior`

for the transition probabilities in the function `mHMM`

. Both objects are a list, containing the elements:

`mu0`

, a list containing \(m\) matrices; one matrix for each row of the transition probability matrix gamma or the emission probability matrix. Each matrix contains the hypothesized mean values of the intercepts. Hence, each matrix consists of one row (when not including covariates in the model) and \(m - 1\) or \(q\) - 1 columns.`K0`

, the number of hypothetical prior subjects on which this vector of means is based.`nu`

, the degrees of freedom of the Inverse Wishart distribution.`V`

, the hypothesized variance-covariance matrix between the set of intercepts.

Note that `K0`

, `nu`

and `V`

are assumed equal over the states. The mean values of the intercepts (and regression coefficients of the covariates) denoted by `mu0`

are allowed to vary over the states. All elements in the list either have the prefix `gamma_`

or `emiss_`

, depending on which list they belong to. When specifying prior distributions, note that the first element of each row does not have an intercept, as it serves as baseline category in the multivariate regression model. This means, for example, that if we would specify a model with 3 states, `mu0`

is a vector with 2 elements, `K0`

and `nu`

contain 1 element and `V`

is a 2 by 2 matrix.

The multilevel HMM is fitted using the function `mHMM`

:

```
# Run a model without covariate(s) and default priors:
set.seed(14532)
out_2st <- mHMM(s_data = nonverbal,
gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
start_val = c(list(start_TM), start_EM),
mcmc = list(J = 1000, burn_in = 200))
```

The call to `mHMM`

specifies the model with several arguments. The `s_data`

argument specifies the input data used to infer the hidden states over time. The `gen`

and `start_val`

argument specify the general model properties and the starting values, as discussed above. The arguments needed for the MCMC algorithm are given in `mcmc`

: `J`

specifies the number of iterations used by the hybrid metropolis within Gibbs algorithm and `burn_in`

specifies the number of iterations to discard when obtaining the model parameter summary statistics. The function `mHMM`

returns an object of class `mHMM`

, which has `print`

and `summary`

methods to see the results. The `print`

method provides basic information on the model fitted. That is, the number of subjects in the dataset analyzed, the number of iterations and burn-in period, the average log likelihood over all subjects and model fit indices AIC, the number of states specified, and the number of dependent variables the states are based on:

```
out_2st
#> Number of subjects: 10
#>
#> 1000 iterations used in the MCMC algorithm with a burn in of 200
#> Average Log likelihood over all subjects: -1624.389
#> Average AIC over all subjects: 3276.777
#>
#> Number of states used: 2
#>
#> Number of dependent variables used: 4
```

The `summary`

method provides information on the estimated parameters. That is, the point estimates of the posterior distribution for the transition probability matrix and the emission distribution of each of the dependent variables at the group level:

```
summary(out_2st)
#> State transition probability matrix
#> (at the group level):
#>
#> To state 1 To state 2
#> From state 1 0.929 0.071
#> From state 2 0.074 0.926
#>
#>
#> Emission distribution for each of the dependent variables
#> (at the group level):
#>
#> $p_vocalizing
#> Category 1 Category 2 Category 3
#> State 1 0.018 0.957 0.024
#> State 2 0.792 0.051 0.152
#>
#> $p_looking
#> Category 1 Category 2
#> State 1 0.248 0.752
#> State 2 0.096 0.904
#>
#> $t_vocalizing
#> Category 1 Category 2 Category 3
#> State 1 0.799 0.075 0.117
#> State 2 0.035 0.945 0.020
#>
#> $t_looking
#> Category 1 Category 2
#> State 1 0.047 0.953
#> State 2 0.277 0.723
```

The resulting model indicates 2 well separated states: one in which the patient is speaking and one in which the therapist is speaking. Looking behavior is quite similar for both the patient and the therapist in the 2 states. Information on the estimated parameters can also be obtained using the function `obtain_gamma`

and `obtain_emiss`

. These functions allow the user not only to inspect the estimated parameters at the group level, but for each subject individually as well, by specifying the input variable `level = "subject"`

:

```
# When not specified, level defaults to "group"
gamma_pop <- obtain_gamma(out_2st)
gamma_pop
#> To state 1 To state 2
#> From state 1 0.929 0.071
#> From state 2 0.074 0.926
# To obtain the subject specific parameter estimates:
gamma_subj <- obtain_gamma(out_2st, level = "subject")
gamma_subj
#> $`Subject 1`
#> To state 1 To state 2
#> From state 1 0.942 0.058
#> From state 2 0.048 0.952
#>
#> $`Subject 2`
#> To state 1 To state 2
#> From state 1 0.936 0.064
#> From state 2 0.060 0.940
#>
#> $`Subject 3`
#> To state 1 To state 2
#> From state 1 0.969 0.031
#> From state 2 0.054 0.946
#>
#> $`Subject 4`
#> To state 1 To state 2
#> From state 1 0.934 0.066
#> From state 2 0.046 0.954
#>
#> $`Subject 5`
#> To state 1 To state 2
#> From state 1 0.942 0.058
#> From state 2 0.058 0.942
#>
#> $`Subject 6`
#> To state 1 To state 2
#> From state 1 0.942 0.058
#> From state 2 0.087 0.913
#>
#> $`Subject 7`
#> To state 1 To state 2
#> From state 1 0.929 0.071
#> From state 2 0.042 0.958
#>
#> $`Subject 8`
#> To state 1 To state 2
#> From state 1 0.930 0.070
#> From state 2 0.081 0.919
#>
#> $`Subject 9`
#> To state 1 To state 2
#> From state 1 0.948 0.052
#> From state 2 0.058 0.942
#>
#> $`Subject 10`
#> To state 1 To state 2
#> From state 1 0.960 0.040
#> From state 2 0.068 0.932
```

An additional option that the functions `obtain_gamma`

and `obtain_emiss`

offer is changing the burn-in period used for obtaining the summary statistics, using the input variable `burn_in`

.

The package includes several plot functions to display the fitted model and its parameters. First, one can plot the posterior densities of a fitted model, for both the transition probability matrix gamma and for the emission distribution probabilities. The posterior densities are plotted for the group level and the subject level simultaneously. For example, for the emission distribution for the variable `p_vocalizing`

:

```
library(RColorBrewer)
Voc_col <- c(brewer.pal(3,"PuBuGn")[c(1,3,2)])
Voc_lab <- c("Not Speaking", "Speaking", "Back channeling")
plot(out_2st, component = "emiss", dep = 1, col = Voc_col,
parameter = "emiss", dep_lab = c("Patient vocalizing"), cat_lab = Voc_lab)
#> Warning in plot.window(...): "parameter" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "parameter" is not a graphical parameter
#> Warning in axis(side = side, at = at, labels = labels, ...): "parameter" is
#> not a graphical parameter
#> Warning in axis(side = side, at = at, labels = labels, ...): "parameter" is
#> not a graphical parameter
#> Warning in box(...): "parameter" is not a graphical parameter
#> Warning in title(...): "parameter" is not a graphical parameter
#> Warning in plot.window(...): "parameter" is not a graphical parameter
#> Warning in plot.xy(xy, type, ...): "parameter" is not a graphical parameter
#> Warning in axis(side = side, at = at, labels = labels, ...): "parameter" is
#> not a graphical parameter
#> Warning in axis(side = side, at = at, labels = labels, ...): "parameter" is
#> not a graphical parameter
#> Warning in box(...): "parameter" is not a graphical parameter
#> Warning in title(...): "parameter" is not a graphical parameter
```

Here, `component`

specifies whether we want to visualize the posterior densities for the transition probability matrix gamma (`component = "gamma"`

) or for the emission distribution probabilities (`component = "emiss"`

), when using `component = "emiss"`

the input variable `dep`

specifies which dependent variable we want to inspect (as the variable `p_vocolizing`

is the first variable in the set, we set `dep = 1`

), `col`

specifies the colors to be used when plotting the lines, `dep_lab`

denotes the label of the dependent variable we are plotting, and `cat_lab`

denotes the labels of the categorical outcomes in the dependent variable. In the plot, the solid line visualizes the posterior density at the group level, while each of the dotted lines visualizes the posterior density of one subject.

Second, one can plot the transition probabilities obtained with the function `obtain_gamma`

with a riverplot:

```
# Transition probabilities at the group level and for subject number 1, respectively:
plot(gamma_pop, col = rep(rev(brewer.pal(3,"PiYG"))[-2], each = m))
plot(gamma_subj, subj_nr = 1, col = rep(rev(brewer.pal(3,"PiYG"))[-2], each = m))
```

Note that graphically displaying the transition probabilities becomes more informative as the number of states increase.

Given a well-fitting HMM, it may be of interest to determine the actual *sequence*, or order of succession, of hidden states that has most likely given rise to the sequence of outcomes as observed in a subject. One can either use local decoding, in which the probabilities of the hidden state sequence are obtained simultaneously with the model parameters estimates, or the well-known Viterbi algorithm (Viterbi 1967; Forney Jr 1973). In local decoding, the most likely state is determined separately at each time point \(t\), in contrast to the Viterbi algorithm in which one determines the joint probability of the complete sequence of observations \(O_{1:T}\) and the complete sequence of hidden states \(S_{1:T}\).

In the package, local decoding can be achieved by saving the sampled hidden state sequence at each iteration of the Gibbs sampler, by setting the input variable `return_path = TRUE`

for the function `mHMM`

. This will result in very large output files, however. Global decoding can be performed by using the function `vit_mHMM`

:

```
state_seq <- vit_mHMM(out_2st, s_data = nonverbal)
head(state_seq)
#> Subj_1 Subj_2 Subj_3 Subj_4 Subj_5 Subj_6 Subj_7 Subj_8 Subj_9
#> [1,] 1 2 2 2 1 2 1 1 1
#> [2,] 1 2 2 2 1 2 1 1 1
#> [3,] 1 2 2 2 2 2 1 1 1
#> [4,] 1 2 2 2 2 2 1 1 1
#> [5,] 1 2 2 2 2 2 2 1 1
#> [6,] 1 2 2 2 2 1 2 1 1
#> Subj_10
#> [1,] 1
#> [2,] 1
#> [3,] 1
#> [4,] 1
#> [5,] 1
#> [6,] 1
```

The function returns the hidden state sequence for each subject in a matrix, where each row represents a point in time and each column represents a subject. We can inspect the obtained hidden state sequence by for example plotting it together with the observed data. Below, the first 5 minutes of the first couple is plotted again, with the addition of the estimated state sequence:

When using Bayesian estimation procedures, it is strongly advised to check model convergence and label switching. That is, one should check if the algorithm reaches the same solution when a set of different (but often conceptually similar) starting values are used, and if label switching is not a problem. With label switching, the label (i.e., which state represents what) ordering of the states switches over the iterations of the estimation algorithm. For example, what started out as state 1, now becomes state 2. One can check model convergence and label switching visually by inspecting the trace plots of parameters of a set of identical models that used varying starting values. Trace plots are plots of the sampled parameter values over the iterations. First, we fit the model with 2 states again, but with different starting values:

```
# specifying general model properties
m <-2
n_dep <- 4
q_emiss <- c(3, 2, 3, 2)
# specifying different starting values
start_TM <- diag(.8, m)
start_TM[lower.tri(start_TM) | upper.tri(start_TM)] <- .2
start_EM_b <- list(matrix(c(0.2, 0.6, 0.2,
0.6, 0.2, 0.2), byrow = TRUE,
nrow = m, ncol = q_emiss[1]), # vocalizing patient
matrix(c(0.4, 0.6,
0.4, 0.6), byrow = TRUE, nrow = m,
ncol = q_emiss[2]), # looking patient
matrix(c(0.6, 0.2, 0.2,
0.2, 0.6, 0.2), byrow = TRUE,
nrow = m, ncol = q_emiss[3]), # vocalizing therapist
matrix(c(0.4, 0.6,
0.4, 0.6), byrow = TRUE, nrow = m,
ncol = q_emiss[4])) # looking therapist
# Run a model identical to out_2st, but with different starting values:
set.seed(9843)
out_2st_b <- mHMM(s_data = nonverbal,
gen = list(m = m, n_dep = n_dep, q_emiss = q_emiss),
start_val = c(list(start_TM), start_EM),
mcmc = list(J = 1000, burn_in = 200))
```

The group level parameter estimates of the emission probabilities and the transition probability matrix at each iteration of the estimation algorithm are stored in the objects `emiss_prob_bar`

and `gamma_prob_bar`

, respectively. The subject level parameter estimates are stored in the object `PD_subj`

, where PD is an abbreviation for posterior density. If we, for example, want to inspect the trace plots for the emission probabilities for looking behavior of the patient at the group level, we use the following code:

```
par(mfrow = c(m,q_emiss[2]))
for(i in 1:m){
for(q in 1:q_emiss[2]){
plot(x = 1:1000, y = out_2st$emiss_prob_bar[[2]][,(i-1) * q_emiss[2] + q],
ylim = c(0,1.4), yaxt = 'n', type = "l", ylab = "Transition probability",
xlab = "Iteration", main = paste("Patient", Look_lab[q], "in state", i), col = "#8da0cb")
axis(2, at = seq(0,1, .2), las = 2)
lines(x = 1:1000, y = out_2st_b$emiss_prob_bar[[2]][,(i-1) * q_emiss[2] + q], col = "#e78ac3")
legend("topright", col = c("#8da0cb", "#e78ac3"), lwd = 2,
legend = c("Starting value set 1", "Starting value set 2"), bty = "n")
}
}
```

It can be observed that the parameter estimates converge to the same parameter space, and that the chains mix well. Also, there is no evidence of label switching.

Altman, Rachel MacKay. 2007. “Mixed Hidden Markov Models: An Extension of the Hidden Markov Model to the Longitudinal Data Setting.” *Journal of the American Statistical Association* 102 (477). Taylor & Francis: 201–10.

Burge, Christopher B, and Samuel Karlin. 1998. “Finding the Genes in Genomic Dna.” *Current Opinion in Structural Biology* 8 (3). Elsevier: 346–54.

Cappé, E. AND Rydén, O. AND Moulines. 2005. *Inference in Hidden Markov Models*. New York: Springer.

Ephraim, Yariv, and Neri Merhav. 2002. “Hidden Markov Processes.” *Information Theory, IEEE Transactions on* 48 (6). IEEE: 1518–69.

Forney Jr, G David. 1973. “The Viterbi Algorithm.” *Proceedings of the IEEE* 61 (3). IEEE: 268–78.

Haan-Rietdijk, S de, Peter Kuppens, Cindy S Bergeman, LB Sheeber, NB Allen, and EL Hamaker. 2017. “On the Use of Mixed Markov Models for Intensive Longitudinal Data.” *Multivariate Behavioral Research* 52 (6). Taylor & Francis: 747–67.

Henderson, John, Steven Salzberg, and Kenneth H Fasman. 1997. “Finding Genes in Dna with a Hidden Markov Model.” *Journal of Computational Biology* 4 (2): 127–41.

Krogh, Anders, I Saira Mian, and David Haussler. 1994. “A Hidden Markov Model That Finds Genes in E. Coli Dna.” *Nucleic Acids Research* 22 (22). Oxford University Press: 4768–78.

Rabiner, Lawrence R. 1989. “A Tutorial on Hidden Markov Models and Selected Applications in Speech Recognition.” *Proceedings of the IEEE* 77 (2). IEEE: 257–86.

Rueda, Oscar M, Cristina Rueda, and Ramon Diaz-Uriarte. 2013. “A Bayesian HMM with Random Effects and an Unknown Number of States for DNA Copy Number Analysis.” *Journal of Statistical Computation and Simulation* 83 (1). Taylor & Francis: 82–96.

Rydén, Tobias. 2008. “EM Versus Markov Chain Monte Carlo for Estimation of Hidden Markov Models: A Computational Perspective.” *Bayesian Analysis* 3 (4). International Society for Bayesian Analysis: 659–88.

Scott, Steven L. 2002. “Bayesian Methods for Hidden Markov Models.” *Journal of the American Statistical Association* 97 (457).

Shirley, Kenneth E, Dylan S Small, Kevin G Lynch, Stephen A Maisto, and David W Oslin. 2010. “Hidden Markov Models for Alcoholism Treatment Trial Data.” *The Annals of Applied Statistics*. JSTOR, 366–95.

Viterbi, Andrew J. 1967. “Error Bounds for Convolutional Codes and an Asymptotically Optimum Decoding Algorithm.” *Information Theory, IEEE Transactions on* 13 (2). IEEE: 260–69.

Woodland, Philip C, and Daniel Povey. 2002. “Large Scale Discriminative Training of Hidden Markov Models for Speech Recognition.” *Computer Speech & Language* 16 (1). Elsevier: 25–47.

Zhang, Yue, and Kiros Berhane. 2014. “Bayesian Mixed Hidden Markov Models: A Multi-Level Approach to Modeling Categorical Outcomes with Differential Misclassification.” *Statistics in Medicine* 33 (8). Wiley Online Library: 1395–1408.

Zucchini, Walter, Iain L MacDonald, and Roland Langrock. 2016. *Hidden Markov Models for Time Series: An Introduction Using R*. Boca Raton: CRC Press.

Note that the likelihood ratio test, commonly used to compare nested models, cannot be used in case of the HMM (i.e., the difference in the log-likelihoods between models is not \(\chi^2\) distributed Rydén (2008)).↩

We note, however, that the AIC approximates the posterior distribution of the parameters by a Gaussian distribution, which might not be appropriate for models including parameters on the boundary of the parameter space (e.g., close to 0 or 1 in case of probability estimates), or for small data sets, as exemplified by Scott (2002). Model selection is therefore not a straightforward procedure in the context of HMM, and the choices remain subjective.↩