Generalized Poisson Hidden Markov Model for Overdispersed or Underdispersed Count Data

The most suitable statistical method for explaining serial dependency in time series count data is that based on Hidden Markov Models (HMMs). These models assume that the observations are generated from a finite mixture of distributions governed by the principle of Markov chain (MC). Poisson-Hidden Markov Model (P-HMM) may be the most widely used method for modelling the above said situations. However, in real life scenario, this model cannot be considered as the best choice. Taking this fact into account, we, in this paper, go for Generalised Poisson Distribution (GPD) for modelling count data. This method can rectify the overdispersion and underdispersion in the Poisson model. Here, we develop Generalised Poisson Hidden Markov model (GP-HMM) by combining GPD with HMM for modelling such data. The results of the study on simulated data and an application of real data, monthly cases of Leptospirosis in the state of Kerala in South India, show good convergence properties, proving that the GP-HMM is a better method compared to P-HMM.


Introduction
Poisson model is the most commonly used method for modelling time series count data. Though equidispersion is the unique feature of Poisson distribution, in practical cases, either the mean will be greater than variance or vice-versa, making the Poisson assumption wrong. In many populations of Poisson nature, the probability of the occurrence of an event does not remain constant and is affected by previous occurrences, resulting in unequal mean and variance in the data (Kendall & Stuart 1963). To deal with such situations, modification and generalization of the Poisson distribution were considered by Greenwood & Yule (1920) and by Neyman (1931). Wang & Famoye (1997) introduced generalized Poisson regression for modelling household fertility decisions. Recently, Cepeda-Cuervo & Cifuentes-Amado (2017) also developed mean and dispersion regression models to fit overdispersed data based on beta binomial and negative binomial models.
An important generalization for the Poisson distribution was introduced by Consul & Jain (1973) with two parameters λ 1 and λ 2 and it can be considered as a limiting form of the generalized negative binomial distribution. Consul & Shoukri (1984) obtained maximum likelihood estimators of the parameters of GPD. The properties of the GPD are discussed by Consul (1989) and Tuenter (2000). The variance of GPD model is greater than, equal to, or less than the mean when the second parameter λ 2 is positive, zero or, negative respectively. Both the mean and variance tend to increase or decrease in value with respect to the change in λ 1 . When λ 2 is positive, the mean and the variance increase in value. However, when λ 2 increases, the variance increases faster than mean which results in overdispersion or vice versa. The probability mass function of GPD is given by where λ 1 > 0 and |λ 2 | < 1. The mean and variance of the GPD are The GPD is often used in researches and studies for modelling data in many situations. It can be used to adjust overdispersion in Poisson model, as in the case of negative binomial model. The GPD is also apt for modelling underdispersed Poisson data. Going by all these details, one can consider GP-HMM as a better option than P-HMM as shown in Sebastian, Jeyaseelan, Jeyaseelan, Anandan, George & Bangdiwala (2019) for count data modelling. However, the idea of using GPD in HMM is not new. In 2014, Witowski et.al made a simulation study for using HMM to improve quantifying physical activity in accelerometer data (Witowski, Foraita, Pitsiladis, Pigeot & Wirsik 2014). They used P-HMM, GP-HMM and normal-HMM for their comparative study. The following part of this paper has been categorized into four sections, detailing the methods and estimation of parameters of GP-HMM, a simulation study and a real data application of GP-HMM.

Generalized Poisson Hidden Markov model
In HMM, there is an underlying unobserved state of the system that changes with time in line with the Markov process. The distribution of observations at a given time is determined by the system's state at that time (Zucchini & MacDonald 2009). Let H t , t ∈ 1, 2, . . . , T be an MC on a finite state space, S = {1, 2, . . . , m} with transition probability matrix A = (a ij ), where a ij = P r[H t+1 = j|H t = i] for any i y j ∈ S and with the initial distribution π = (π 1 , π 2 , . . . , π m ) ′ , where π i = P r[H 1 = i] for any i ∈ S. The MC H t , defined on a finite state space, is homogenous and irreducible. So, the initial distribution π is stationary, which satisfies the condition π ′ = π ′ A.
Let X t , t ∈ N , an HMM, be a particular type of dependent mixture, with X (t) = (X 1 , . . . , X t ) and H (t) = (H 1 , . . . , H t ) having the following properties.
The first property, 'parameter process', H t , t = 1, 2, . . . satisfies Markov property, while the second one, 'state-dependent process', X t , t = 1, 2, . . . such that the distribution of X t , depends only on the current state H t and not on previous states or observations. Now, we introduce some notations -the probability mass function (pmf) of X t , given the Markov chain is in state i at time t, is denoted by p i and it is called state-dependent distribution of the model.
In the case of GPD, the pmf is given by Now, π i (t) = P r(H t = i) is the unconditional probabilities of MC being at state i at time t. Let X t be a discrete valued random variable, such that This expression can be rewritten in matrix form as and, therefore, If the MC is stationary, then π(1)A t−1 = π and in this case, In the case of a GP-HMM, the mean and variance of X t are given by So, the model {X t , H t } t = 1, 2, . . . , T is characterized by (i) the stationary initial distribution π, (ii) the transition probability matrix A and (iii) the statedependent pmfs p i (x). Let Φ = (Π, A, Θ) denote the set of parameter space. The parameters to be estimated are: the (m 2 − m) transition probabilities (a ij ) for any i = 1, 2, . . . , m and j = 1, 2, . . . , m − 1; the m entries of the vector Π and the m parameters of λ 1i and λ 2i of the GP random variables X t . Hence, the vector of unknown parameters is given by ϕ = (a 11 , . . . , a 1m−1 , . . . , a m1 , . . . , a mm−1 , λ 11 , λ 12 , . . . , λ 1m , λ 21 , λ 22 , . . . , λ 2m ) ′ which belongs to the parameter space Φ. The initial distribution π will be estimated by the equality π ′ = π ′ A, after the estimation of the matrix A. The estimators of the vector ϕ will be obtained by the EM algorithm. The likelihood function is given by: (2)

Joint Probability Mass Functions of the Model
In HMM {X t ; H t } with the sequence of observations, x 1 , x 2 , . . . , x T and the sequence of states of the Markov chain h 1 , h 2 , . . . , h T , the joint pmf is given by Summing over h 1 , h 2 , . . . , h T , we have the joint pmf Define P (x) as the diagonal matrix with i th diagonal element p i (x), and we have When π is stationary then it can be replaced by π ′ A. Now, the joint pmf becomes If π is not stationary, then the state of the Markov chain has to be estimated. So, the estimate of the initial distribution from one observation at time 1 is not useful. If it is stationary, clearly π is fully estimated by its transition probabilities (Zucchini & MacDonald 2009).

Estimation by EM Algorithm
A commonly used method of fitting HMMs is the EM algorithm introduced by Dempster, Laird & Rubin (1977) for the computation of maximum likelihood estimates based on incomplete data. Also Pereira, Marques & da Costa (2012) have studied the performance of the estimates produced by EM algorithm for mixture model. Here, we use this tool for the estimation of parameters of HMM with forward and backward probabilities (Baum 1972), which are also adopted for decoding and state prediction of HMM. For t = 1, 2, . . . , T , the (row) vector α t is as follows: with π denoting the initial distribution of the MC. The elements of α t are called forward probabilities. For t = 1, 2, . . . , T , and j = 1, 2, . . . , m, we have the joint probability α t (j) = P r( which is the j th component of α t . Now, the vector of backward probabilities β t , for t = 1, 2, . . . , T , is defined by The j th component of β t is the conditional probability and the elements of β t are called backward probabilities. For t = 1, 2, . . . , T − 1, and j = 1, 2, . . . , m, we have the conditional probability where X b a denotes the vector (X a , X a+1 , . . . , X b ). For t = 1, 2, . . . , T , and i = 1, 2, . . . , m, the probability and consequently α t β ′ t = P r(X (T ) = x (T ) ) = L T , for each t. In HMM, this property is used for applying the EM algorithm and in local decoding. For t = 1, 2, . . . , T , firstly and secondly, The EM algorithm is an iterative method for performing maximum likelihood estimation when some data are missing. In this case, the complete-data log-likelihood (CDLL) -the log-likelihood of the parameters of interest θ, based on both the observed data and the missing data-is to be maximized. The algorithm is started by choosing values for the parameters Θ. Then, the following steps are repeated: • E step: Compute the conditional expectations of those functions of the missing data that appear in the CDLL.
• M step: Maximize, with respect to Θ, CDLL with the functions of the missing data replaced by their conditional expectations.
The sequence of states h 1 , h 2 , . . . , h T followed by the MC is defined by the zero-one random variables and is given by Now, the log-likelihood of the observations x 1 , x 2 , . . . , x T and the missing data h 1 , h 2 , . . . , h T are given by Hence, the CDLL is Here, π is to be understood as the initial distribution of the MC (the distribution of H 1 ), not necessarily the stationary distribution. Of course, it is not reasonable to try to estimate the initial distribution from just one observation at time 1, especially as the state of the MC itself is not observed. The EM algorithm for HMMs proceeds as follows: • E step: Replace the quantities v jk (t) and u j (t) by their conditional expectations given the observations x (T ) , then it is given bŷ • M step: Replace v jk (t) and u j (t) byv jk (t) andû j (t) and maximize the CDLL with respect to the three sets of parameters: the initial distribution π, the transition probability matrix A and the parameters of the state-dependent distributions λ 11 , λ 12 , . . . , λ 1m , λ 21 , λ 22 , . . . , λ 2m .
M step splits neatly into three separate maximizations: term 1 depends only on the initial distribution π, term 2 on the transition probability matrix A and term 3 on the state-dependent parameters of GPD and they are estimated using numerical maximization Zucchini & MacDonald (2009).

A Simulation Study
This section shows the results of a simulation study conducted on the performance of the GP-HMM model. We assessed the performance of the estimates using the mean squared error (MSE) on the simulated data. The simulations were repeated on different sample sizes ranging from 500 to 10 000 and on different parameter values. Prior to checking the results obtained for the parameter estimates, we had run the EM algorithm with m = 2, . . . , 6 states to select the appropriate number of states of the GP-HMM. For this purpose, we computed the AIC and BIC for each GP-HMM. Upon computation, both Akaike information criterion (AIC) and Bayesian information criterion (BIC) gave the lowest value for m = 2 state. So, we chose, two state GP-HMM for further computation. The initial values of the parameters are given below a 11 = 0.9, a 12 = 0.1, λ 11 = 10, λ 12 = 30, λ 21 = 0.7, and λ 22 = 0.6.
We had run 50 replications of the EM algorithm on the simulated data and then computed the MSE and biases (given in brackets), the results of which to shown in Table 1. It shows that MSE and biases of the estimate of the parameters tend to become zero as the sample size increases. Also, the mean estimates of the parameters come closer to the true parameter values.

Real Data Application
The GP-HMM model is applied to a real-life data set. A series of monthly Leptospirosis incidence counts in the state of Kerala in South India during the 2006-2017 period is considered for the analysis. The data are sourced from the official website of the Directorate of Health Services, Government of Kerala, India. A graphical representation of the said data, with 144 time points, is shown in figure  1. As many as 14,460 cases, having the mean value 100.42 and variance 3,342.41, are subjected to study. In this case, the sample variance is greater than its mean, which shows the data is clearly overdispersed. As per the data, February 2014 recorded the minimum Leptospirosis cases of 20, while the maximum of 292 was reported in September 2008. two models, HMM with Poisson distribution and GP-HMM, were estimated and compared on the basis of -LogL, AIC and BIC. Table 2 presents the relevant comparison. In this, table k represents the number of parameters. For P-HMM k = m 2 and GP-HMM k = m(m + 1). The model is considered to be the most apt one, which can be identified using BIC. So the model with two-state GP-HMM is the most appropriate.   Figure 1 shows the result of fitting four state P-HMM and two state GP-HMM to the leptospirosis series using EM estimates. The four-state model is fitted to the leptospirosis series by using HMM with Poisson distribution, while the two-state model is fitted to the data using GPD as state dependent distribution. However, on close analysis of the results, the GP-HMM model seems to fit the data better. The estimated transition probability matrices for four-state P-HMM and two-state GP-HMM are The whole analysis is done using R software. The package HM M pa is used for modelling Witowski & Foraita (2013).

Discussion and Conclusion
We propose to deal with overdispersion and underdispersion in time series of count data by introducing GPD in HMM. Here, the is EM algorithm used for the estimation of the parameters, by implementing the R-package (Witowski & Foraita 2013). Also, the validity of the estimates is verified by carrying out a simulation study. We take out an original data -monthly count of cases Leptospirosis in Kerala between 2006 and 2017-to show that GP-HMM performs much better than P-HMM. Leptospirosis, according to the World Health Organization, is a bacterial disease detected in places which witness excess rainfall and flooding. It transmits to humans through the cuts on the skin or through the mucous membranes of the eyes, nose and mouth with water contaminated with the urine of infected animals. In this study, the occurrences of the bacterial disease are modelled using the HMM because of the dependent nature of the leptospirosis data. The P-HMM is the most widely used method for modelling such data. In this case, we develop GP-HMM for fitting leptospirosis series and compare it with P-HMM using AIC and BIC. From all these findings, we prove that GP-HMM is much better compared to P-HMM.
For modelling overdispersed count data, a suitable model is a mixture of Poisson model. It can be seen that the negative binomial distribution (NBD) and GPD are in fact mixtures of Poisson models, where the mixing distributions are continuous distributions. The mixing distribution in the case of NBD is a gamma distribution and is suitable for modelling data with excess zeroes (Joe & Zhu 2005). However, GPD is better suited when more mass is concentrated on the tail of the distribution compared to NBD. Further, also GPD is suitable for both overdispersed and under-dispersed data. This might be the reason for getting better fit on GP-HMM for the data in our example. When excess zero and heavy tail are present, we may use a zero-inflated GP-HMM for modelling serially correlated data. We will concentrate our attention on this aspect in our further study. The results based on our simulation study and example will help both theoreticians and practitioners for making inference on unequal mean-variance scenario in the case of serially correlated time series count data. editors and referees for sending valuable feedback and comments, which helped us improve the manuscript.