Generalized Linear Models (GLM)

TagsCS 229Regressions

A new mindset

A probabilistic interpretation of regression is really interesting, because it unifies many things. For example, in our linear regression, we had yxNy | x \sim \mathcal{N}, and in our classification one, we had yxBery | x \sim Ber. Can we expand to generalized linear models? And what do these even mean??

Distributions, expectations, hypothesis

A hypothesis hθh_\theta returns a single number given xx. We know that yxy | x is a distribution. How does this work? Well, you can think about hθh_\theta as returning E[yx]E[y | x]. This makes sense in the case of linear regression and logistic regression, and it's a good thing to think about moving forward.

The distribution yxy | x is a conditional distribution, whose parameters depend on xx and θ\theta. For example, of yxBer(ϕ)y | x \sim Ber(\phi), this ϕ\phi is a function of xx.

Exponential family

We will be hearing more from this later!

We are about to introduce something a little confusing, but it is needed to work torwards generalized linear models. We define the exponential family as

p(y;η)=b(y)exp(ηTT(y)a(η))p(y; \eta) = b(y) \exp(\eta^TT(y) - a(\eta))

The η\eta is the natural parameter (also known as the canonical parameter), T(y)T(y) is the sufficient statistic, which usually is T(y)=yT(y) = y. a(η)a(\eta) is the log partition function.

The function that gives the distribution's mean as a function of the natural parameter is called the canonical response function. The inverse function is called the canonical link function

When you fix T,a,bT, a, b, you get a family of distributions that is parameterized by η\eta.

Exponential families include bernoulli, gaussian, multinomial, poisson, gamma, exponential, beta, Dirchlet, etc.

Key properties

  1. E[y]=ηa(η)E[y] = \frac{\partial}{\partial \eta} a(\eta). This is really neat because it prevents us from integrating.
    • proof: we use a pretty nifty integration trick

      Start from something we know: p(y)dy=1\int p(y) dy = 1. Therefore, ηp(y)dy=0\frac{\partial}{\partial \eta} \int p(y) dy = 0. However, we can expand this out

  1. Var(y)=2η2a(η)Var(y) = \frac{\partial^2}{\partial \eta^2} a(\eta)
    • Proof: we use what we derived previously
  1. when we do a MLE WRT η\eta, the function is convex
    • proof:

      We will derive the hessian of logp(y)-\log p(y) and show that it's convex

      Now, we just need to show that this is PSD:

GLM from the exponential family

To construct a generalized linear model from the exponential family, we make three design choices

First, that yxy | x is in the exponential family . Second, that the prediction function hh is just the expected value of the distribution. Third, that the intermediate parameter η\eta is a linear function of xx (hence the term LINEAR model)

So the exponential family produces a distribution p(y;η)p(y; \eta). We use this to model what we want, which is p(yx;θ)p(y | x; \theta). We can imagine the η\eta as being a sort of "latent" variable from which the 1d distribution arises.

To get the final model, we derive a canonical response function that maps η\eta to the canonical parameters of the distribution like ϕ,μ\phi, \mu which give us hθh_\theta implicitly.

To derive this response function, we just look at yxy | x and try to squeeze it into the exponential family form.

To summarize:

You can actually think of hθ(x)=g(θTx)h_\theta(x) = g(\theta^Tx) as an expression of GLM. The θTx\theta^T x is just the η\eta, and the gg is the canonical response funciton!

Optimizing your GLM

The only learnable parameter is θ\theta. Again, think of η\eta as your latent variable that θ\theta helps map to.

For any models in the exponential family, we optimize our model by doing

θ=argmaxθp(yx;θ)\theta = \arg \max_\theta p(y | x; \theta)

Or in other words, fit the yy with the xx with as high of a probability as possible. To derive the update rule, we start with the distribution, like the gaussian for linear regression or the bernoulli for logistic regression. Then, we substute in hθ(x)h_\theta(x) for the parameters like μ\mu or ϕ\phi. This substitution turns the original distribution, which depends only on yy and some parameters, into a distribution conditioned on both θ\theta and xx. This allows you to take the derivative and optimize.

Deriving the general update rule

Our distribution is

p(y)=b(y)exp(ηya(η))p(y) = b(y)\exp(\eta y - a(\eta))

By our design, we have

p(yx)=b(y)exp(θTxya(θTx))p(y|x) = b(y)\exp(\theta^Txy - a(\theta^Tx))

Our objective is to perform MLE, so we can just do log-probability:

logp(yx)=logb(y)+yxTθa(xTθ)\log p(y|x) = \log b(y) + yx^T\theta - a(x^T\theta)

Taking the gradient gets us

θ=yxa(xTθ)x=(ya(xTθ))x\nabla_\theta = yx - a'(x^T\theta)x = (y - a'(x^T\theta))x

Therefore, the update is

θ:=θ+α(yhθ(x))xj\theta := \theta + \alpha(y - h_\theta(x))x_j

Interestingly, as we observe, we get a=hθ=E[yx]a' = h_\theta = E[y|x], which is what we showed before as well.

Derivations of distributions

Strategies

  1. rewrite things as explog\exp \log. This often loosens things up with inner exponents, etc
  1. group all terms with yy together. The coefficient becomes your η\eta
  1. Find the canonical response function by solving for the parameter that forms η\eta
  1. Make modifications to the response function to solve for the expectation
    1. for example, in a Bernoulli distribution, you can get p=eη/(1+eη)p = e^\eta / (1 + e^\eta), but you need to scale it by NN to get E[yx]E[y | x] (which is npnp)
  1. Replace η=θTx\eta = \theta^Tx

Let's look at some examples!

Bernoulli as Exponential Family

Canonical parameter: ϕ\phi

The Bernoulli is in the exponential family. We can expand the equation like this

Canonical response function: We see that η=logϕ1ϕ\eta = \log \frac{\phi}{1 - \phi}, which means that ϕ=g(η)=11+eη\phi = g(\eta) = \frac{1}{1 + e^{-\eta}}.

T(y)=ya(η)=log(1ϕ)=log(1+eη)b(y)=1T(y) = y \\ a(\eta) = -\log(1-\phi) = log(1 + e^\eta) \\b(y) = 1

Connection to the Sigmoid

One thing you might have noticed is that the canonical response function is just the sigmoid! This means that if you assume that things are distributed as bernoulis, the distribution's parameters are best described by a sigmoid! Neat!!

Gaussian as Exponential family

Canonical parameter: μ\mu

You can do a similar expanding from the gaussian equation (for now, we assume that σ=1\sigma = 1

Now, we see that

b(y)=12πexp(12y2)η=μT(y)=ya(η)=η2/2b(y) = \frac{1}{\sqrt{2\pi}}\exp(-\frac{1}{2}y^2) \\ \eta = \mu \\T(y) = y \\ a(\eta) = \eta^2 / 2

Here, we see that the canonical response function is just μ=g(η)=η\mu = g(\eta) = \eta.

Multinomial as an exponential family

Canonical parameters: ϕ1,...ϕk\phi_1, ... \phi_k, where each ϕi\phi_i represents the probability of one outcome. Note how you would need a vector to represent this.

*Because we are dealing with essentially a vector of outcomes, we are dealing with vectors instead of scalars. We have yy equal to the "category" of the outcomes, and we define T(y)T(y) as

You can also write this as

T(y)i=1{y=i}T(y)_i = 1\{y = i\}

which will be more important below.

Now, we are ready to derive the exponential family. We start with the multinomial and make our substitutions. It's important to express ϕk\phi_k as a sum of other ϕ\phi, as it is a constraint

Where

Now, let's look at the canonical response function. If you look at η\eta, we see that the function is just

ϕi=exp(ηi)ϕk\phi_i = \exp(\eta_i) * \phi_k

To get ϕk\phi_k, let's use the identity that ϕi=1\sum \phi_i = 1:

ϕkikeηi=ikϕi=1\phi_k\sum_i^ke^{\eta_i} = \sum_i^k\phi_i = 1

And this means that ϕk=1/eηi\phi_k = 1/\sum e^{\eta_i}, which means that

. To recap, what is shown above is the canonical response function for the multinomial distribution, which maps between a vector of logit values to a probability distribution. This is known as the softmax function.

Constructing GLM's

Ordinary Least squares

Recall that yxN(μ,σ2)y | x \sim \mathcal{N}(\mu, \sigma^2). As such, we have hθ(x)=E[yx;θ]h_\theta(x) = E[y | x; \theta]

However, we know that the expectation of a normal is just μ\mu, and we have previously derived that μ=η\mu = \eta. Therefore, hθ(x)=θTxh_\theta(x) = \theta^Tx. Neat!! This is what we expected!!

Logistic regression

Here, we actually derive why we use the sigmoid!!

Recall that yxBer(ϕ)y | x \sim Ber(\phi), where ϕ\phi is defined implicitly with xx. Now, the following is true:

And we arrive upon our logistic regression equation! And this kinda derives the sigmoid function!!

Softmax regression

This one is interesting. Instead of a binary classifier, we have a kk value classifier, and we assume that it's distributed according to a multinomial distribution whose parameters ϕi\phi_i depend on the input xx. Now, we have previously derived the canonical response function for the multinomial:

Furthermore, we know that ηi=θiTx\eta_i = \theta_i^T x, so we get the following:

Another way of thinking about this is that we find the logit (the θiTx\theta_i^Tx) and then we apply a softmax.

Softmax regression: The graphical intuition

The matrix for θT\theta^T is just a row matrix with one parameter for each class, essentially.

So you can imagine θTx\theta^T x as running kk separate linear "tests" on the current point, like the graphic shown below:

Each of these results are fed into a softmax and we get a probability distribution. The intuition is that the groups are separated by these lines, and depending on how close they are to the boundary, we get different "activations".

The raw logit results can have a variety of different results, but the one that has the highest score (i.e. the "test" the returns the most positive result of this point being in that class) will have the highest value in the probability distribution.

The tl;dr: softmax regression means running multiple agents and having them vote for which category this unknown point is in.

Softmax regression: Learning

To learn the parameters θ\theta, we can use a simple log likelihood:

If you wanted to derive the gradient, it is feasible.

We can also learn using a different, loss-based approach called cross entropy. In cross entropy, we compare two distributions and penalize based on their differences. This is done with the following equation

CrossEntropy(p,p^)=p(y)logp^(y)=Ep[logp^(y)]CrossEntropy(p, \hat{p}) = -\sum p(y) \log \hat{p}(y) = -E_p[\log \hat{p}(y)]

This is like entropy but it's a mutual sort of entropy. Sampling from the truth pp, how "surprised" are we to see p^\hat{p} in a location?