1.

Generalized Linear Model

One of the most important tasks for any

researcher, doing regression analysis is to choose the best possible model for

his data set. How well does the model

works on practice depends on our constraints and the coincide between our

assumptions and the assumptions of the model itself.

The Gaussian models, that were discussed in the

Section 3.1 has some requirements for using this model: the outcome variable

must be continuous and normally distributed. For the outcome variable that is a continuous

distribution and is far away from any theoretical maximum or minimum, such

sorts of Gaussian models has maximum entropy. In social science research, it is

very rarely the case that we have continuous outcome without any boundaries. In

most cases it is typical to have dichotomous, nominal or ordinal outcomes. For

example, if we assume now, that instead of the continuous outcome, we have the variable

that is either discrete or bounded, a Gaussian likelihood is not more the most

powerful choice and no more the choice with the highest possible entropy. For

example, if the outcome is a count number such as the number of blue balls

pulled from a bag (instead of the probability of drawing a blue ball from the

bag in the previous example), then such outcome is constrained to be 0 or a

positive integer – because it is a count number for real objects. The Gaussian

model with such a variable would not perform perfectly and would be

inappropriate for several reasons. Firstly, while we imply the assumption that

counts cannot be negative, a linear regression model does not account for that.

Such model will predict negative values whenever the mean count is close to 0. Secondly,

we will probably meet the problem of heteroscedasticity and non-normalilty of

errors, that arises when the outcome is not continuous. Thus, most statistical tests

on the parameters that are used for the linear regression model will be unjustified,

that means that we cannot trust the results of such tests anymore because some

of the basic linear model assumptions are violated. Thirdly, the functional

form specified by the linear model will be always incorrect. If we want to have

a trustful model, we need to have such a linear function that would truthfully

account for all specific limitations on our variables, especially in terms of

the outcome variable.

Therefore,

a researcher meets here a typical problem:

if the intuition beyond the problem adds additional logical or

theoretical assumptions which could not be accounted by the model chosen to

solve the problem, then the results provided by the model could be biased.

Thus, for

discrete or bounded outcome variables we need to find out the better way of

solving this problem. To do so, we need

to put some constraints on the positive values and then use the maximum entropy

for the choice of distribution. In such case, all we have to do is to

generalize the linear regression strategy to probability distributions other

than Gaussian. In other words, we need to extend the spectrum of probability

distributions that could be used. And this is the essence of the GLM. Such

models look as follows:

yi ~

Binomial (n, pi)

f (pi) = ? + ?xi

Generally, there are only two changes here from the

Gaussian model considered in the section 3.1. Firstly, in the case of bounded

outcome variable the likelihood should be binomial instead of Gaussian: for a

count outcome y for which each observation arises from n trials

and with constant expected value np, the binomial distribution has

maximum entropy. As well as in case of only two possible outcomes. So, the

binomial distribution will be the least informative distribution that satisfies

the prior knowledge of the outcomes y. However, if the outcome variable had

different constraints, the maximum entropy distribution could be different as

well. Secondly, the GLM models need a link function f (pi), which should be determined

separately from the choice of distribution.

1.1

Exponential Family of

distributions and the principle of maximum entropy

The

Exponential distribution is a fundamental distribution of distance and duration

– those kinds of measurement that represent displacement from some point of

reference, either in time or space. Logically, this distribution is constrained

to be 0 or positive. If the probability of event is constant in time or across

space, then the distribution of events tends towards exponential. The

exponential distribution has the maximum among all non-negative continuous

distributions with the same average displacement.

The Gamma

distribution is also constrained to be 0 or positive and also a fundamental

distribution of distance and duration. The main difference between Gamma

distribution and Exponential distribution is that the first one could have a

peak above zero. If an event can only happen after two or more other events

happened before it, then the last event will be gamma distributed. The Gamma

distribution has maximum entropy among all distributions with the same mean and

same average logarithm.

The

Poisson distribution is a count distribution. Mathematically, the Poisson

distribution is a specific case of the binomial distribution. If the number of

trials n is very large (and usually unknown) and the probability of a success p

is very small, then a binomial distribution converges to a Poisson distribution

with an expected rate of events per unit time of ? = np.

Practically, the Poisson distribution is used for counts that never get close

to any theoretical maximum. As a special case of the binomial, it has maximum

entropy under exactly the same constraints.

Generally,

a researcher can think about all these distributions and the restrictions they

account for and the make his choice in the favor of one or another

distribution, but it is almost impossible to think about all of them. Or, at

least, it will take too much time. Luckily, there is an easier and more

parsimonious way to choose the proper likelihood function for a GLM – the

principle of maximum entropy. It provides an empirically

successful way to choose through all of these distribution, because it accounts

for all specific restrictions that we have mentioned above, and then

automatically chooses the appropriate distribution. Information entropy is

essentially a measure of the number of ways a distribution can arise, according

to stated assumptions.

One more advantage of the principle of maximum entropy is that we

can use it to choose priors. This approach would be very effective in the case

of the GLMs because GLMs are easy to use with conventional weakly informative

priors, because the maximum entropy provides a way to generate such prior, that

would account for all the background information, while assuming as little as

possible. This helps researcher to make truthful choices and to avoid bias.

1.2

Link functions and their role

in the GLM

In this section, we pay more attention to the

problem of outcome restrictions and why we use GLMs to model situations where

we need to account for strict assumptions.

Usually,

in the real-life situations we do not observe a mean parameter “?” describing the average outcome, or even if we observe

such, it is rarely unbounded in both directions. For example, the shape of the

binomial distribution is determined, like the Gaussian, by two parameters. But

unlike the Gaussian, neither of these parameters is the mean. Instead, the mean

outcome is np, which is a function of both parameters. Since n is

usually known, the standard approach is to attach a linear model to the unknown

part, p – a probability mass. Thus, pi must

lie between 0 and 1. The problem here is that the linear model itself could

freely fall below zero or exceeding one and thus predict a linear trend which

will go beyond our restrictions, so that the shape of the model will be wrong.

Thus, it is common a common problem for

researcher to meet such problems as negative distances or probability masses

that exceed 1. To prevent such mathematical accidents, we use the Link

function. Main goal of such function is to map the linear space of a model like

? + ?xi

and the non-linear space of a

parameters. For most GLMs we use one of two, the most common link functions: a

logit link function for probabilities mapping (to constrain the probabilities

to be between 0 and 1) and a log link function for positive values mapping (to

avoid negative distances).

The Logit

link maps a parameter that is defined as a probability mass – p onto a linear

model that can take on any real value. This link is extremely common when

working with binomial GLMs. When we use a login link for a

parameter, we simultaneously defining the parameter’s value to be the logistic

transform of the linear model. In other words, we transform the linear space

indicating unit changes (which is continuous in both positive and negative

directions) into the space constrained entirely between 0 and 1. This transformation

does affect interpretation of parameter estimates, because no longer does a

unit change in a predictor variable produce a constant change in the mean of

the outcome variable. Instead, a unit change in xi may produce a larger or smaller

change in the probability pi, depending upon how far from zero the log-odds are.

Implementing such method changes the interpretation but in general helps us to

solve the problem with bias: now the probability is restricted in the linear

model, that is p lies between 0 and 1.

The second very common link function

is the Log link. This function maps a parameter that is defined over only

positive real values onto a linear model. For example, suppose we want to model

any parameter that is restricted to be positive real number. A log link is very

useful in this situation because it assumes that the parameter’s value is the

exponentiation of the linear model.

Log link function in a linear model implies an exponential scaling

of the outcome with the predictor variable. We know the value of the argument

of the exponential function is restricted to be positive so this problem is

solved. However, with the log link function we can simultaneously produce another

problem: when the model is asked to predict value outside the range of data used

to fit it. In other words, we might have some problems with forecasting based

on the model that we have built.

To sum up this section, using the most

common logit and log link functions for linear models does solve the problem of

constraining the probability to be in the range from 0 to 1 and the parameter

to be positive. However, link functions might sometimes distort the inference.

If researcher wants to avoid such distortion, then it is possible to do the

sensitivity analysis. Such analysis explores how changes in assumptions

influence inference. If none of the alternative assumptions have much impact on

inference, then we can remain with the link function that have been chosen. The necessity of choosing a link function to

bind the linear model to the generalized outcome solves the problem with adding

some constraints on restricted values or probabilities however, simultaneously

introduces new complexities in model specification, estimation, and

interpretation.

2.

Conclusion

Entropy maximization in the probability theory

is much about counting and such counting allows us to solve the problem of the

choice among distributions. Of course, there is no guarantee that it gives us

the best probability distribution for the real empirical problem that we

analyze. But this is a guarantee that there exists no other distribution that

would more conservatively reflect our assumptions, because any other

distribution implies something more than just our assumptions, something

unknown. A full and truthful accounting of assumptions is crucial, because it gives

us an understanding how a model misbehaves. And since all models misbehave

sometimes, it is good to be on the safe side and anticipate those misbehaves

before they happen.

To sum up, the principle of maximum entropy

provides an empirically successful way to choose likelihood functions. Thus, we

can generally apply this method for models of any complexity. In terms of the Generalized

Linear Models, the maximum entropy helps us to account for the variable

constraints in the most simple and direct way. With our bet on the maximum

entropy distribution we might not get the perfect result on practice, but at

least we will not include unreasonable constraints that could bias the final

result. It is always important to remember that the principal of maximum

entropy gives us different solutions for different real-life problems. Based on

the characteristics of the problem itself the maximum entropy solution could be

Binomial distribution, Gaussian distribution or any other. So, the main

principle here is to think first about the data itself.