Convergence Theorems in Multinomial Saturated and Logistic Models

In this paper, we develop a theoretical study about the logistic and saturated multinomial models when the response variable takes one of R ≥ 2 levels. Several theorems on the existence and calculations of the maximum likelihood (ML) estimates of the parameters of both models are presented and demonstrated. Furthermore, properties are identified and, based on an asymptotic theory, convergence theorems are tested for score vectors and information matrices of both models. Finally, an application of this theory is presented and assessed using data from the R statistical program.


Introduction
For classifying individuals, logistic regression models are generalized linear models used (Cox 1958), which allow covariates that have been continuously and categorically escalated to predict any outcome that has been categorically escalated (Darlington 1990). These models do not generate assumptions on explanatory variable distributions and generalize multiple regression analysis techniques to cases having a categorical dependent variable and a categorical or numerical predictor variable. Hence, regardless of study design, the logistic regression model is a direct probability model that is able to provide valid estimates (Harrell 2015). Accordingly, one of the most common model variants that is assessed is when the response variable is binary, whether nominal or ordinal, as evidenced by Hosmer & Lemeshow (2000), Agresti (2013) and Monroy, Morales & Dávila (2018). Furthermore, LLinás (2006) discussed certain theoretical details of these models.
For example, applied sciences (such as biomedical and social sciences) often deal with nominal response variables at several levels as well as a vector of explanatory variables in which certain components may be intervals while others may be nominal scale measurements. In this case and other cases in which the answers are not ordinal and the levels are not organized in a sequential order, a multinomial logistic regression may assist in assessing the relationship between nominal responses and the set of explanatory variables; moreover, its applicability is computationally accessible (Chan 2005, Long 1987). The multinomial logit model is a generalized linear model, which can be considered as a direct extension of the binary logit model since its categories can contract and reduce it to a binary model. As per Begg & Gray (1984), this reduced model is able to properly estimate both logistic parameters and their corresponding standard errors. In fact, it is a special case of discrete choice models (or conditional logit models) introduced by McFadden (1973) who generalized binary logistic regression by allowing more than two discrete responses.
Based on our literature review, several studies have been published on multinomial logistic regression models. This technique has become widespread and its use has been very critical within social sciences, marketing applications, and demographic and educational research studies (Chuang 1997, Peng, Lee & Ingersoll 2002, Pohlman & Leitner 2003. For example, in classical works, such as Hosmer & Lemeshow (2000), Díaz & Morales (2009), Kleinbaum & Klein (2010) and Agresti (2013), these models are viewed as alternative solutions to monitor data analysis-related issues. Among their multiple applications, Anderson, Verkuilen & Peyton (2010) used a multinomial logistics model to study the psychometric validity of a multiple response instrument. Similarly, as evidenced in Tan, Christiansen, Christensen, Kruse & Bathum (2004); Chen & Kao (2006) and McNevin, Santos, Gamez-Tato, Álvarez-Dios, Casares, Daniel, Phillips & Lareu (2013), these models have been very useful in genetics. Subsequently, to assess the reliability of cancer diagnoses, multinomial logistic regression models have been used (Lloyd & Frommer 2008). Furthermore, in epidemiology, this model has been used in an explanatory study of factors affecting malaria treatments in African pregnant women (Exavery, Mbaruku, Mbuyita, Makemba, Kinyonge & Kweka 2014). In economics, the model is implemented as part of a study that assessed occupational patterns and trends in The Netherlands (Dessens, Jansen, G. B. Jansen, Ganzeboom & Van der Heijden 2003). In demography, to determine risks associated with variables considered for each of these studies, Kim (2015) and Schnor, Vanassche & Bavel (2017) used three-level multinomial logistic regression models. Finally, Monyai, Lesaoana, Darikwa & Nyamugure (2016) applied multinomial logistic regression to educational factors derived from the 2009 General Household Survey in South Africa; however, Ekström, Esseen, Westerlund, Grafström, Jonsson & Ståhl (2018) applied logistic regression models to data collected from environmental monitoring programs.
Nevertheless, certain studies, such as Fahrmeir & Kaufmann (1985), McCullagh & Neider (2018), and Agresti (2013), failed to provide a detailed development of a general asymptotic theory for maximum likelihood (ML) estimation for independent but not identically distributed variables.
However, classical Mathematical Statistics books, such as Zacks (1971) and Rao, Rao, Statistiker, Rao & Rao (1973), only mention identically distributed independent variables that are not applicable to generalized linear models. Nevertheless, many studies, such as Wedderburn (1974), Wedderburn (1976), or McCullagh (1983, discuss the more generalized concept of quasi-likelihood functions, which are important for logistical models with repeated measurements. Therefore, because of the critical role that multinomial logistic models play in several applications, this study seeks to develop a theory for independent but not identically distributed variables, i.e., a theory that is indeed outlined in the literature but not discussed in detail. Therefore, for independent but not identically distributed variables, theoretical details must be generalized to multinomial model applications when the response variable takes any of R ≥ 2 levels, which is the primary contribution of this work. In fact, multinomial models in which the response variable may take one of three levels are addressed in LLinás & Carreño (2012) and LLinás, Arteta & Tilano (2016). This paper is organized as follows. Section 2 presents the multinomial model, and Section 3, we briefly introduce the satured model. Then, in Section 4, we discuss the results from the score vector and the information matrix for the saturated model. Subsequently, in Section 5, we develop the theory corresponding to the multinomial logistic model. Section 6 provides the results from the score vector and the information matrix for this logistic model. Then, in Section 7, we present and demonstrate a theorem on the existence of logistic parameters and Section 8 addresses an application of the previously introduced theory. The paper ends with some conclusions in the Section 9.

Basic Model
Assume that the variable of interest Y able to assume R levels 0, 1, . . . , R − 1. For each r = 0, 1, . . . , R − 1, we use the notation p r := P (Y = r) to represent the probability of Y taking the value of r. Making independent observations of Y , we obtain the sample Y = (Y 1 , Y 2 , . . . , Y n ) T with y = (y 1 , y 2 , . . . , y n ) T , where y i ∈ {0, 1, 2, · · · , R−1}, i = 1, · · · , n, is a possible value of the sampling variable Y i . Note that the variables Y i are independent of each other. To build the likelihood function, R independent variables will be created with values in {0, 1}, as follows: U ri = 1 si Y i = r and U ri = 0, otherwise, where r = 0, 1, . . . , R − 1 y i = 1, . . . , n. Note that U ri has a Bernoulli distribution with parameter p ri = P (Y i = r). In terms of U ri , the sampling variables will be U i = (U 0i , U 1i , . . . , U (R−1)i ), with values u i = (u 0i , u 1i , . . . , u (R−1)i ), where R−1 r=0 u ri = 1, for a fixed i value. Therefore, we achieve the following: Fixing y we get the likelihood function L for parameter p = (p 0 , p 1 , . . . , p (R−1) ) T , with p r := (p r1 , p r2 , . . . , p rn ) T , and using this, the logarithm of the likelihood function is obtained as follows:

Saturated Model
The saturated model is characterized by the following assumptions: 1. The following are the basic assumptions for the model: (a) There are K explanatory variables X 1 , . . . , X K (numerical or categorical) with values x 1i , . . . , x Ki , for i = 1, . . . , n (fixed or observed according to deterministic or random variables).
For each population, j = 1, . . . , J, where: • n j is the number of Y ij observations (or U rij observations in the category r) in each j population. Then, n 1 + · · · + n J = n.
2. For each j = 1, . . . , J population and each i = 1, . . . , n observation in j, it is assumed that the variables (U ri | ) are independent of each other, with a Bernoulli distribution with parameter p rj = P (U ri = 1| ) = E(U ri | ). Here Hereinafter, the will be pressed. This second assumption implies that, for each r = 0, 1, . . . , R − 1 and each population j = 1, . . . , J: (a) All p rij for i = 1, . . . , n within each population j are equal. That is, the dimensional vector (R−1)J is taken as a parameter p = (p 1 , The Z rj variables are independent among populations with a binomial distribution with parameters n j and p rj , With m rj := E(Z rj ) = n j p rj and V rj : Values are gathered in the z = (z 0 , z 1 , . . . , z R−1 ) T vector, where the z r components are defined by z r := (z r1 , z r2 , . . . , z rJ ) T . From the above: The elements of this matrix can be seen in item (b) in the proof of theorem 4.
According to (1), the logarithm of the maximum likelihood function will be: The following theorem evidences the ML estimator for the saturated model parameters and the possible values that the logarithm of the corresponding likelihood function can accept when assessed in a point estimate vector of the parameter vector.

Score and Information of the Saturated Model
In this section, we present and demonstrate some asymptotic properties both for the score vector and for the information matrix in the saturated model, highlighting the fact that we are using independent but not identically distributed variables, thus providing the details that are not yet found in the literature. These results are important for perfoming comparison tests using the logistic model. In the following theorem the corresponding properties are presented.
Theorem 2. In the saturated model: (a) The score vector of the sample is a column vector of size (R − 1)J and is given by The information matrix of the sample is a square matrix of size (R − 1)J and is given by where A rr ′ are diagonal matrices of size J and whose diagonal elements are where A rr is the diagonal matrix described in the previous item.
(a) We denote the random score vector of the i observation by S i (p). The results are obtained immediately if the following is taken into account: assumption 1 from section 3, the property (c) With j, k = 1, 2, . . . , J and r, r ′ = 0, 1, . . . , R − 2.
(1) If j ̸ = k and r ̸ = r ′ , then (2) If j = k and r = r ′ , then: (3) Now, for r ̸ = r ′ and j = k, The results found in these three cases therefore demonstrate the subsection.
the result is immediate Remark 1. In this case,Ĩ has diagonalizable main submatrices with nj vrj > 0 elements and the determinant is positive, which implies thatĨ positive definite and therefore non singular and the ℑ − 1 2 root exists (see Harville (1997)).
In the following theorem, asymptotic results are presented and demonstrated both for the score vector and the information matrix in the saturated model.
Here, a = means asymptotic equivalence; Proof .
(a) As per item (a) of theorem 2, the diagonal matrix whose elements are the same diagonal elements ofĨ. Then, 1 n (S(p) − S * (p)) is a vector column of size J that has the following as elements For a fixed j and r, Suppose r, r ′ and j are fix and

Now, it is known that
Zrj nj − p rj p − → 0 if n → ∞, by the weak law of the large numbers. Because nĨ * .
(c) As the variables Z rj converge to the normal distribution by the multivariate ). Then, for a fixed value of J: (d) Considering item (d) of theorem 2, we have: we have Remark 2. The assumption from theorem 3 can be interpreted as follows: "The speed" of each n j −→ ∞ must be the same as that of n −→ ∞. For example, in a balanced design, all the n j are equal. In this case, n j = n J . Therefore, the amount 1 n · nj vj = 1 Jvj is fixed; that is, it does not depend on n. Using the notations from the previous theorem, σ 2 j = Jv j because, in this case, the expression (4) becomes an equal value of 1 nĨ =Ξ.

Multinomial Logistic Model
Assumptions 1 and 2 of section 3 include that the design matrix: has full rank with Rg(C) = 1 + K ≤ J. To define the logistic model, we take one of the categories of the dependent variable Y as the reference; let us say for R − 1, with the following additional assumption: where x j := (1, x j1 , . . . , x jK ) T . With β r := (δ r , β r1 , . . . , β rK ) T , the parameter vector α = (β 0 , β 1 , . . . , β R−2 ) T is a column vector of size (R−1)(1+K). Note that the assumption Rg(C) = 1+K is important for the parameter α to be identifiable.
For an x j observation in population j and for each r, where g R−1 (x j ) = 0. The logarithm of the likelihood function can be written as a function of α as follows: The likelihood equations are found by calculating the first derivatives of L (α) with respect to each of the (R − 1)(1 + K) unknown parameters. Therefore, for each fixed k = 0, 1, . . . , K and each r = 0, 1, . . . , R, the likelihood equations are given by x jk (Z rj − n j p rj ) . (7) The maximum likelihood estimator is obtained by equating these equations to zero and solving the logistic parameters. The solution requires the same type of iterations that were used to obtain the estimates in binary cases and with three levels, as demonstrated in LLinás (2006)

Logistic Model Information and Score
The following theorem shows some properties of the score vector and the information matrix in a logistic model.

(a) Taking into account that Rg
Here, ⊗ is the Kronecker product and I R−1 , the identity matrix of order R −1. This indicates that ℑ has a full rank. Now, we must prove that ℑ is positive definite. That is, ∀ u ̸ = 0, u T ℑu > 0. With u ̸ = 0 being any column vector of (R − 1)(1 + K), we have But V 1/2 I (R−1) ⊗ C u is a column vector of (R − 1)(1 + K). Therefore, for all u ̸ = 0, it is true that u T ℑu ≥ 0. However, u T ℑu = 0 if and only if it is true that V 1 2 (I R−1 ⊗ C) u = 0. Now, • For r ̸ = r ′ , the components −n j p rj p r ′ j < 0.
T being any vector of real numbers, we want to prove that has a asymptotic one-dimensional normal distribution, where S i (α) is the score vector of observation i. This is checked below: corresponding to the logistic model is given by It is clear that V = V * ℑV * , where V * is as per theorem 2(d). Now, . Therefore, as per Tilanos Theorem 3.3.1b (Tilano & Arteta 2012) Then, we must prove that 1 Considering the above and knowing that Ξ is positive definite, we have In addition Lindberg's condition holds, that is, Then, when applying the multivariate central limit theorem, it is concluded that Remark 4. For the case of non-grouped data, where J = n, the assumption given in (b) makes no sense because J is not fixed. Then, it is assumed immediately that 1 J ℑ has a positive definite limit Ξ and 1

Existence and Calculations of the Logistic Parameters
Theorem 6 (Existence theorem). The ML-estimationsα of α exist, are unique and calculated according to the following recursion formula: whereV andm are the estimated matrix of covariances and the expected vector of Z, respectively, (hence,V andm), defined in section 3 assumption 2. In addition, asymptotically we have Proof . By theorem 4 (c) ∂ 2 L ∂α 2 = −ℑ is a full rank matriz (R−1)(K +1). Therefore, only ML-estimationsα may be used as solutions of the (R − 1)(K + 1) equations Alternatively, by theorem 4 (a), I R−1 ⊗ C T (Z − m) = 0. The following must be true: ∂L (α) ∂α = 0. Using the Taylor approximation, if α 1 is a point between α andα, then Considering Equation (9), , this expression can be rewritten by theorem 4 aŝ As α 1 is a point on the line segment that joins α andα, α 1 = tα + (1 − t)α, for all t ∈ [0, 1]. Under the assumption thatα is strongly consistent for α, that is,α .
Revista Colombiana de Estadística -Theoretical Statistics 43 (2020) 211-231 Replacing α on the right side by the t-th approximationα (t) of α, we obtain the recursion formula that provides the (t + 1) − th approximationα (t+1) ofα. Fontanella, Early & Phillips (2008) present results from a study that examines the influence of both clinical and non-clinical factors on level of aftercare decisions. The corresponding data set (named APS data) were modified to protect confidentiality. It can be downloaded from the aplore3 library in the R program. For the application, we selected only two variables, which are described in Table 1. To apply the multinomial logistics model, we have used Day treatment or Outpatient (2) as the reference outcome value. When these data were entered in the R statistical package and grouped into populations. First, the Table 2 shows the cross-classification of PLACE3 versus history of violence (VIOL). We observe that the dependent variable Y takes one of the possible R = 3 values and the explanatory variable X is dichotomic. We found that J = 2, with n 1 = 121 (Group for X = 0) and n 2 = 387 (Group for X = 1). As the estimator of the covariance matrix of the maximum likelihood estimator is the inverse of the observed information matrix, then the estimated covariance matrix for the fitted model is When we apply the package nnet in R, we get the same results as above. The results of fitting the three-category logistic regression model, using the multinom function, to these data are presented in Table 3. In this table, appear the estimated coefficients, the estimated standard error of the coefficients, the values of the estimated odds ratio ( OR) and the 95% confidence interval for the odds ratio for PLACE3 = 0 versus PLACE3 = 2 and for PLACE3 = 1 versus PLACE3 = 2.

Example
From the Table, we see that statistically the VIOL variable is significantly associated with adolescent placement. Table 4 shows the results obtained when performing the comparison test of the null model with the logistic model.  From this Table,  which yields a p-value of 0.0002. In conclusion, having a history of violence is a significant factor for being placed in some type of residential facility.

Conclusions
Recent studies, such as Zacks (1971), Rao et al. (1973), Fahrmeir & Kaufmann (1985), McCullagh & Neider (2018), and Agresti (2013), fail to provide a detailed development of a general asymptotic theory for ML estimation for independent but not identically distributed variables of generalized linear models. Furthermore, other works, such as Wedderburn (1974), Wedderburn (1976), or McCullagh (1983, only limit themselves to discussing the more general concept of quasilikelihood functions, which are important for logistic models with repeated measurements. Based on these gaps identified in the literature, for independent but not identically distributed variables, theoretical details must be generalized to multinomial model applications when the response variable consider any R ≥ 2 levels. This is the primary contribution of this work. In fact, multinomial models where the response variable may take one of three levels are addressed in LLinás & Carreño (2012) and LLinás et al. (2016)). In this study, we extended these last two works to cover the cases where R > 3.
We assessed multinomial logistic and saturated models where the response variable takes one of R ≥ 2 values, emphasizing the fact that we used independent but not identically distributed variables, thus providing details that are not yet found in the literature. For this purpose, we demonstrated the properties of the score vector and the information matrices for these models. Furthermore, based on an asymptotic theory, we proved the convergence theorems for the score vector and information matrices of both models and emphasized on the fact that these vectors exhibit normal multivariate distributions. Moreover, we validated a theorem about the existence and calculation of ML estimates for the parameters of the multinomial saturated model. Similarly, we presented and demonstrated a theorem for the existence of ML estimates for logistic parameters, thus briefly explaining the iteration method used for its calculation, i.e., the NewtonRaphson method.
Based on the results from this work, we will be able to compare the logistic model against its corresponding saturated model, which will allow us to reduce the number of observations and perform faster computer-based assessments. For future studies, the results yielded by this study may be used to construct test statistics, as well as their corresponding asymptotic distributions.
• If R−2 r=0 z rj = n j then ∂ 2 L ∂p 2 rj prj =prj = n j p rj > 0. In this case, L increases inp rj .
That is, L (p) assumes a maximum whenp rj = 1.