1. Generalized Linear ModelOne of the most important tasks for anyresearcher, doing regression analysis is to choose the best possible model forhis data set. How well does the modelworks on practice depends on our constraints and the coincide between ourassumptions and the assumptions of the model itself. The Gaussian models, that were discussed in theSection 3.1 has some requirements for using this model: the outcome variablemust be continuous and normally distributed.
For the outcome variable that is a continuousdistribution and is far away from any theoretical maximum or minimum, suchsorts of Gaussian models has maximum entropy. In social science research, it isvery rarely the case that we have continuous outcome without any boundaries. Inmost cases it is typical to have dichotomous, nominal or ordinal outcomes. Forexample, if we assume now, that instead of the continuous outcome, we have the variablethat is either discrete or bounded, a Gaussian likelihood is not more the mostpowerful choice and no more the choice with the highest possible entropy. Forexample, if the outcome is a count number such as the number of blue ballspulled from a bag (instead of the probability of drawing a blue ball from thebag in the previous example), then such outcome is constrained to be 0 or apositive integer – because it is a count number for real objects. The Gaussianmodel with such a variable would not perform perfectly and would beinappropriate for several reasons. Firstly, while we imply the assumption thatcounts cannot be negative, a linear regression model does not account for that.Such model will predict negative values whenever the mean count is close to 0.
Secondly,we will probably meet the problem of heteroscedasticity and non-normalilty oferrors, that arises when the outcome is not continuous. Thus, most statistical testson the parameters that are used for the linear regression model will be unjustified,that means that we cannot trust the results of such tests anymore because someof the basic linear model assumptions are violated. Thirdly, the functionalform specified by the linear model will be always incorrect. If we want to havea trustful model, we need to have such a linear function that would truthfullyaccount for all specific limitations on our variables, especially in terms ofthe outcome variable.
Therefore,a researcher meets here a typical problem: if the intuition beyond the problem adds additional logical ortheoretical assumptions which could not be accounted by the model chosen tosolve the problem, then the results provided by the model could be biased. Thus, fordiscrete or bounded outcome variables we need to find out the better way ofsolving this problem. To do so, we needto put some constraints on the positive values and then use the maximum entropyfor the choice of distribution. In such case, all we have to do is togeneralize the linear regression strategy to probability distributions otherthan Gaussian. In other words, we need to extend the spectrum of probabilitydistributions that could be used. And this is the essence of the GLM. Suchmodels look as follows:yi ~Binomial (n, pi)f (pi) = ? + ?xi Generally, there are only two changes here from theGaussian model considered in the section 3.
1. Firstly, in the case of boundedoutcome variable the likelihood should be binomial instead of Gaussian: for acount outcome y for which each observation arises from n trialsand with constant expected value np, the binomial distribution hasmaximum entropy. As well as in case of only two possible outcomes. So, thebinomial distribution will be the least informative distribution that satisfiesthe prior knowledge of the outcomes y. However, if the outcome variable haddifferent constraints, the maximum entropy distribution could be different aswell.
Secondly, the GLM models need a link function f (pi), which should be determinedseparately from the choice of distribution. 1.1 Exponential Family ofdistributions and the principle of maximum entropy TheExponential distribution is a fundamental distribution of distance and duration- those kinds of measurement that represent displacement from some point ofreference, either in time or space. Logically, this distribution is constrainedto be 0 or positive. If the probability of event is constant in time or acrossspace, then the distribution of events tends towards exponential.
Theexponential distribution has the maximum among all non-negative continuousdistributions with the same average displacement. The Gammadistribution is also constrained to be 0 or positive and also a fundamentaldistribution of distance and duration. The main difference between Gammadistribution and Exponential distribution is that the first one could have apeak above zero. If an event can only happen after two or more other eventshappened before it, then the last event will be gamma distributed. The Gammadistribution has maximum entropy among all distributions with the same mean andsame average logarithm.
ThePoisson distribution is a count distribution. Mathematically, the Poissondistribution is a specific case of the binomial distribution. If the number oftrials n is very large (and usually unknown) and the probability of a success pis very small, then a binomial distribution converges to a Poisson distributionwith an expected rate of events per unit time of ? = np.Practically, the Poisson distribution is used for counts that never get closeto any theoretical maximum.
As a special case of the binomial, it has maximumentropy under exactly the same constraints. Generally,a researcher can think about all these distributions and the restrictions theyaccount for and the make his choice in the favor of one or anotherdistribution, but it is almost impossible to think about all of them. Or, atleast, it will take too much time. Luckily, there is an easier and moreparsimonious way to choose the proper likelihood function for a GLM – theprinciple of maximum entropy. It provides an empiricallysuccessful way to choose through all of these distribution, because it accountsfor all specific restrictions that we have mentioned above, and thenautomatically chooses the appropriate distribution. Information entropy isessentially a measure of the number of ways a distribution can arise, accordingto stated assumptions. One more advantage of the principle of maximum entropy is that wecan use it to choose priors. This approach would be very effective in the caseof the GLMs because GLMs are easy to use with conventional weakly informativepriors, because the maximum entropy provides a way to generate such prior, thatwould account for all the background information, while assuming as little aspossible.
This helps researcher to make truthful choices and to avoid bias. 1.2 Link functions and their rolein the GLMIn this section, we pay more attention to theproblem of outcome restrictions and why we use GLMs to model situations wherewe need to account for strict assumptions. Usually,in the real-life situations we do not observe a mean parameter “?” describing the average outcome, or even if we observesuch, it is rarely unbounded in both directions. For example, the shape of thebinomial distribution is determined, like the Gaussian, by two parameters. Butunlike the Gaussian, neither of these parameters is the mean. Instead, the meanoutcome is np, which is a function of both parameters.
Since n isusually known, the standard approach is to attach a linear model to the unknownpart, p – a probability mass. Thus, pi mustlie between 0 and 1. The problem here is that the linear model itself couldfreely fall below zero or exceeding one and thus predict a linear trend whichwill go beyond our restrictions, so that the shape of the model will be wrong. Thus, it is common a common problem forresearcher to meet such problems as negative distances or probability massesthat exceed 1. To prevent such mathematical accidents, we use the Linkfunction. Main goal of such function is to map the linear space of a model like? + ?xiand the non-linear space of aparameters.
For most GLMs we use one of two, the most common link functions: alogit link function for probabilities mapping (to constrain the probabilitiesto be between 0 and 1) and a log link function for positive values mapping (toavoid negative distances). The Logitlink maps a parameter that is defined as a probability mass – p onto a linearmodel that can take on any real value. This link is extremely common whenworking with binomial GLMs. When we use a login link for aparameter, we simultaneously defining the parameter’s value to be the logistictransform of the linear model. In other words, we transform the linear spaceindicating unit changes (which is continuous in both positive and negativedirections) into the space constrained entirely between 0 and 1. This transformationdoes affect interpretation of parameter estimates, because no longer does aunit change in a predictor variable produce a constant change in the mean ofthe outcome variable. Instead, a unit change in xi may produce a larger or smallerchange in the probability pi, depending upon how far from zero the log-odds are.
Implementing such method changes the interpretation but in general helps us tosolve the problem with bias: now the probability is restricted in the linearmodel, that is p lies between 0 and 1. The second very common link functionis the Log link. This function maps a parameter that is defined over onlypositive real values onto a linear model. For example, suppose we want to modelany parameter that is restricted to be positive real number. A log link is veryuseful in this situation because it assumes that the parameter’s value is theexponentiation of the linear model. Log link function in a linear model implies an exponential scalingof the outcome with the predictor variable.
We know the value of the argumentof the exponential function is restricted to be positive so this problem issolved. However, with the log link function we can simultaneously produce anotherproblem: when the model is asked to predict value outside the range of data usedto fit it. In other words, we might have some problems with forecasting basedon the model that we have built.
To sum up this section, using the mostcommon logit and log link functions for linear models does solve the problem ofconstraining the probability to be in the range from 0 to 1 and the parameterto be positive. However, link functions might sometimes distort the inference.If researcher wants to avoid such distortion, then it is possible to do thesensitivity analysis. Such analysis explores how changes in assumptionsinfluence inference. If none of the alternative assumptions have much impact oninference, then we can remain with the link function that have been chosen. The necessity of choosing a link function tobind the linear model to the generalized outcome solves the problem with addingsome constraints on restricted values or probabilities however, simultaneouslyintroduces new complexities in model specification, estimation, andinterpretation. 2.
ConclusionEntropy maximization in the probability theoryis much about counting and such counting allows us to solve the problem of thechoice among distributions. Of course, there is no guarantee that it gives usthe best probability distribution for the real empirical problem that weanalyze. But this is a guarantee that there exists no other distribution thatwould more conservatively reflect our assumptions, because any otherdistribution implies something more than just our assumptions, somethingunknown. A full and truthful accounting of assumptions is crucial, because it givesus an understanding how a model misbehaves. And since all models misbehavesometimes, it is good to be on the safe side and anticipate those misbehavesbefore they happen.
To sum up, the principle of maximum entropyprovides an empirically successful way to choose likelihood functions. Thus, wecan generally apply this method for models of any complexity. In terms of the GeneralizedLinear Models, the maximum entropy helps us to account for the variableconstraints in the most simple and direct way.
With our bet on the maximumentropy distribution we might not get the perfect result on practice, but atleast we will not include unreasonable constraints that could bias the finalresult. It is always important to remember that the principal of maximumentropy gives us different solutions for different real-life problems. Based onthe characteristics of the problem itself the maximum entropy solution could beBinomial distribution, Gaussian distribution or any other. So, the mainprinciple here is to think first about the data itself.