Logistic Regression

TagsCS 229Regressions

Logistic regresson

The concept is very simple. Define a predictor hθ(x)h_\theta(x) that outputs a value between 0 and 1. We define it as the following:

hθ(x)=g(θTx)=11+eθTxh_\theta(x) = g(\theta^Tx) = \frac{1}{1 + e^{-\theta^Tx}}

The gg is a sigmoid function, and g=g(1g)g' = g(1-g) (very helpful identity)

A probabilistic view: deriving the loss function

If we look at things probabilistically, we can sort of view p(y=1x)p(y = 1| x) as the output of hθ(x)h_\theta(x). It ranges from 0 to 1, and it sort of represents the "confidence" of the model. As such, we can define the following

P(y=1x;θ)=hθ(x)P(y=0x;θ)=1hθ(x)P(y = 1| x; \theta) = h_\theta(x) \\ P(y = 0|x;\theta) = 1 - h_\theta(x)
👉
As such, you can imagine yxy | x as a bernoulli random variable

We can also express this as

Now, the likelihood is just

And when we do the log likelihood, we get

This is our objective—to maximize \ell.

Deriving the update

You can take the gradient through this function and you will get the following (using the derivative identity)

As such, the update is just

If you wanted to find the hessian, it's just

hθ(x(i))(1hθ(x(i)))xxTh_\theta(x^{(i)})(1-h_\theta(x^{(i)}))xx^T

which you can show is positive semidefinite, meaning that it is convex at every point.

Brief interlude on softmax regression

This is just assuming that the yxy | x is a multinomial. You can see the math in the GLM section, but you get that

Perceptron learning algorithm

Instead of a sigmoid, if we had a threshold function

and used the same update rule, we get the perceptron learning algorithm .

The update rule actually has a pretty strong interpretation here. First, let's view the boundary graphically.

The θ\theta is orthogonal to the dividing line, because here θTx=0\theta^Tx = 0. Now, say that you got something wrong; you had y=1y = 1 but h=0h = 0 (shown in the figure). We know that yh>0y - h > 0, which means that we move θ\theta more towards the rogue point. However, in doing so, they move the dividing line (dotted line) to include that rogue point.

Partially observed positives

This is a pretty neat problem: what if we had partial observability in a classification problem? Say that we had this setup, where tt is the true value and yy is the observation

We can show that

p(t=1y=1,x)=1p(t = 1 | y = 1, x) = 1

In other words, if we observe a positive, then it must be positive.

We can also show that

p(t=1x)=1αp(y=1x)p(t= 1 | x) = \frac{1}{\alpha}p(y = 1 | x)

In other words, the probability of a true positive is proportional to the probability of an observed positive. Taken together, this means that if we wanted to do logistic regression (i.e. model p(y=1x)p(y=1|x), the true decision boundary is just shifted a little bit.

To derive α\alpha, you make the key assumption that p(t=1x){0,1}p(t = 1|x) \in \{0, 1\}. In other words, there is no doubt about the true values. However, you can't assume the same thing for p(y=1x)p(y = 1 |x), because it could be a true positive and yet not get measured as one. The probability a true positive is α\alpha,