1. variable must be continuous and normally distributed.

1.   
Generalized Linear Model

One of the most important tasks for any
researcher, doing regression analysis is to choose the best possible model for
his data set.  How well does the model
works on practice depends on our constraints and the coincide between our
assumptions and the assumptions of the model itself.

The Gaussian models, that were discussed in the
Section 3.1 has some requirements for using this model: the outcome variable
must be continuous and normally distributed.  For the outcome variable that is a continuous
distribution and is far away from any theoretical maximum or minimum, such
sorts of Gaussian models has maximum entropy. In social science research, it is
very rarely the case that we have continuous outcome without any boundaries. In
most cases it is typical to have dichotomous, nominal or ordinal outcomes. For
example, if we assume now, that instead of the continuous outcome, we have the variable
that is either discrete or bounded, a Gaussian likelihood is not more the most
powerful choice and no more the choice with the highest possible entropy. For
example, if the outcome is a count number such as the number of blue balls
pulled from a bag (instead of the probability of drawing a blue ball from the
bag in the previous example), then such outcome is constrained to be 0 or a
positive integer – because it is a count number for real objects. The Gaussian
model with such a variable would not perform perfectly and would be
inappropriate for several reasons. Firstly, while we imply the assumption that
counts cannot be negative, a linear regression model does not account for that.
Such model will predict negative values whenever the mean count is close to 0. Secondly,
we will probably meet the problem of heteroscedasticity and non-normalilty of
errors, that arises when the outcome is not continuous. Thus, most statistical tests
on the parameters that are used for the linear regression model will be unjustified,
that means that we cannot trust the results of such tests anymore because some
of the basic linear model assumptions are violated. Thirdly, the functional
form specified by the linear model will be always incorrect. If we want to have
a trustful model, we need to have such a linear function that would truthfully
account for all specific limitations on our variables, especially in terms of
the outcome variable.

            Therefore,
a researcher meets here a typical problem: 
if the intuition beyond the problem adds additional logical or
theoretical assumptions which could not be accounted by the model chosen to
solve the problem, then the results provided by the model could be biased.  

            Thus, for
discrete or bounded outcome variables we need to find out the better way of
solving this problem.  To do so, we need
to put some constraints on the positive values and then use the maximum entropy
for the choice of distribution. In such case, all we have to do is to
generalize the linear regression strategy to probability distributions other
than Gaussian. In other words, we need to extend the spectrum of probability
distributions that could be used. And this is the essence of the GLM. Such
models look as follows:

yi ~
Binomial (n, pi)

f (pi) = ? + ?xi

            Generally, there are only two changes here from the
Gaussian model considered in the section 3.1. Firstly, in the case of bounded
outcome variable the likelihood should be binomial instead of Gaussian: for a
count outcome y for which each observation arises from n trials
and with constant expected value np, the binomial distribution has
maximum entropy. As well as in case of only two possible outcomes. So, the
binomial distribution will be the least informative distribution that satisfies
the prior knowledge of the outcomes y. However, if the outcome variable had
different constraints, the maximum entropy distribution could be different as
well. Secondly, the GLM models need a link function f (pi), which should be determined
separately from the choice of distribution.

 

1.1    
Exponential Family of
distributions and the principle of maximum entropy

            The
Exponential distribution is a fundamental distribution of distance and duration
– those kinds of measurement that represent displacement from some point of
reference, either in time or space. Logically, this distribution is constrained
to be 0 or positive. If the probability of event is constant in time or across
space, then the distribution of events tends towards exponential. The
exponential distribution has the maximum among all non-negative continuous
distributions with the same average displacement.

            The Gamma
distribution is also constrained to be 0 or positive and also a fundamental
distribution of distance and duration. The main difference between Gamma
distribution and Exponential distribution is that the first one could have a
peak above zero. If an event can only happen after two or more other events
happened before it, then the last event will be gamma distributed. The Gamma
distribution has maximum entropy among all distributions with the same mean and
same average logarithm. 

            The
Poisson distribution is a count distribution. Mathematically, the Poisson
distribution is a specific case of the binomial distribution. If the number of
trials n is very large (and usually unknown) and the probability of a success p
is very small, then a binomial distribution converges to a Poisson distribution
with an expected rate of events per unit time of ? = np.
Practically, the Poisson distribution is used for counts that never get close
to any theoretical maximum. As a special case of the binomial, it has maximum
entropy under exactly the same constraints.

            Generally,
a researcher can think about all these distributions and the restrictions they
account for and the make his choice in the favor of one or another
distribution, but it is almost impossible to think about all of them. Or, at
least, it will take too much time. Luckily, there is an easier and more
parsimonious way to choose the proper likelihood function for a GLM – the
principle of maximum entropy. It provides an empirically
successful way to choose through all of these distribution, because it accounts
for all specific restrictions that we have mentioned above, and then
automatically chooses the appropriate distribution. Information entropy is
essentially a measure of the number of ways a distribution can arise, according
to stated assumptions.

One more advantage of the principle of maximum entropy is that we
can use it to choose priors. This approach would be very effective in the case
of the GLMs because GLMs are easy to use with conventional weakly informative
priors, because the maximum entropy provides a way to generate such prior, that
would account for all the background information, while assuming as little as
possible. This helps researcher to make truthful choices and to avoid bias.

 

1.2    
Link functions and their role
in the GLM

In this section, we pay more attention to the
problem of outcome restrictions and why we use GLMs to model situations where
we need to account for strict assumptions.

 Usually,
in the real-life situations we do not observe a mean parameter “?” describing the average outcome, or even if we observe
such, it is rarely unbounded in both directions. For example, the shape of the
binomial distribution is determined, like the Gaussian, by two parameters. But
unlike the Gaussian, neither of these parameters is the mean. Instead, the mean
outcome is np, which is a function of both parameters. Since n is
usually known, the standard approach is to attach a linear model to the unknown
part, p – a probability mass. Thus, pi must
lie between 0 and 1. The problem here is that the linear model itself could
freely fall below zero or exceeding one and thus predict a linear trend which
will go beyond our restrictions, so that the shape of the model will be wrong.

Thus, it is common a common problem for
researcher to meet such problems as negative distances or probability masses
that exceed 1. To prevent such mathematical accidents, we use the Link
function. Main goal of such function is to map the linear space of a model like
? + ?xi
and the non-linear space of a
parameters. For most GLMs we use one of two, the most common link functions: a
logit link function for probabilities mapping (to constrain the probabilities
to be between 0 and 1) and a log link function for positive values mapping (to
avoid negative distances).

            The Logit
link maps a parameter that is defined as a probability mass – p onto a linear
model that can take on any real value. This link is extremely common when
working with binomial GLMs. When we use a login link for a
parameter, we simultaneously defining the parameter’s value to be the logistic
transform of the linear model. In other words, we transform the linear space
indicating unit changes (which is continuous in both positive and negative
directions) into the space constrained entirely between 0 and 1. This transformation
does affect interpretation of parameter estimates, because no longer does a
unit change in a predictor variable produce a constant change in the mean of
the outcome variable. Instead, a unit change in xi may produce a larger or smaller
change in the probability pi, depending upon how far from zero the log-odds are.
Implementing such method changes the interpretation but in general helps us to
solve the problem with bias: now the probability is restricted in the linear
model, that is p lies between 0 and 1.

            The second very common link function
is the Log link. This function maps a parameter that is defined over only
positive real values onto a linear model. For example, suppose we want to model
any parameter that is restricted to be positive real number. A log link is very
useful in this situation because it assumes that the parameter’s value is the
exponentiation of the linear model.

Log link function in a linear model implies an exponential scaling
of the outcome with the predictor variable. We know the value of the argument
of the exponential function is restricted to be positive so this problem is
solved. However, with the log link function we can simultaneously produce another
problem: when the model is asked to predict value outside the range of data used
to fit it. In other words, we might have some problems with forecasting based
on the model that we have built.

      To sum up this section, using the most
common logit and log link functions for linear models does solve the problem of
constraining the probability to be in the range from 0 to 1 and the parameter
to be positive. However, link functions might sometimes distort the inference.
If researcher wants to avoid such distortion, then it is possible to do the
sensitivity analysis. Such analysis explores how changes in assumptions
influence inference. If none of the alternative assumptions have much impact on
inference, then we can remain with the link function that have been chosen. The necessity of choosing a link function to
bind the linear model to the generalized outcome solves the problem with adding
some constraints on restricted values or probabilities however, simultaneously
introduces new complexities in model specification, estimation, and
interpretation.

 

2.   
Conclusion

Entropy maximization in the probability theory
is much about counting and such counting allows us to solve the problem of the
choice among distributions. Of course, there is no guarantee that it gives us
the best probability distribution for the real empirical problem that we
analyze. But this is a guarantee that there exists no other distribution that
would more conservatively reflect our assumptions, because any other
distribution implies something more than just our assumptions, something
unknown. A full and truthful accounting of assumptions is crucial, because it gives
us an understanding how a model misbehaves. And since all models misbehave
sometimes, it is good to be on the safe side and anticipate those misbehaves
before they happen.

To sum up, the principle of maximum entropy
provides an empirically successful way to choose likelihood functions. Thus, we
can generally apply this method for models of any complexity. In terms of the Generalized
Linear Models, the maximum entropy helps us to account for the variable
constraints in the most simple and direct way. With our bet on the maximum
entropy distribution we might not get the perfect result on practice, but at
least we will not include unreasonable constraints that could bias the final
result. It is always important to remember that the principal of maximum
entropy gives us different solutions for different real-life problems. Based on
the characteristics of the problem itself the maximum entropy solution could be
Binomial distribution, Gaussian distribution or any other. So, the main
principle here is to think first about the data itself.