Linear Regression

TagsCS 229Regressions

The model

hθ(x)=θ0+θ1x1+...θnxnh_\theta(x) = \theta_0 + \theta_1x_1 + ...\theta_nx_n

IF we let x0=1x_0 = 1, then you can actually create a vector x=[x0,x1,...xn]x = [x_0, x_1, ... x_n], which means that

hθ(x)=θTxh_\theta(x) = \theta^Tx

The cost function

The cost function is the vertical distance between the predicted value and the real value. We use vertical distance because we care about the line fitting better to the vertical (predicted values) and less so the horizontal values; otherwise we'd use orthogonal distance.

We define the cost function as follows:

J(θ)=12i=1n(hθ(x(i))y(i))2J(\theta) = \frac{1}{2} \sum_{i = 1}^n(h_\theta(x^{(i)}) - y^{(i)})^2

Gradient descent

Now, we will update the parameters θ\theta by this algorithm

θj:=θjαθjJ(θ)\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j}J(\theta)

Here, α\alpha is the learning rate. But what is the partial derivative?

θjJ(θ)=θj12i=1n(hθ(x(i))y(i))2\frac{\partial}{\partial \theta_j}J(\theta) = \frac{\partial}{\partial \theta_j}\frac{1}{2} \sum_{i = 1}^n(h_\theta(x^{(i)}) - y^{(i)})^2

Now, we can use the chain rule and the linearity of differentiation to get the following:

=i=1n((hθ(x(i))y(i))θj(hθ(x(i))y)) = \sum_{i = 1}^n\left((h_\theta(x^{(i)}) - y^{(i)})\frac{\partial}{\partial \theta_j}(h_\theta(x^{(i)}) - y)\right)

Now, recall that hθ(x)=θTxh_\theta(x) = \theta^Tx. We know that the partial derivative WRT θi\theta_i of θTx=xi\theta^Tx = x_i. As such, we get

=i=1n((hθ(x(i))y(i))xj(i)) = \sum_{i = 1}^n\left((h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}\right)

this is known as the LMS update rule (least mean squares), or the Widrow-Hoff learning rule. The summation is for batch gradient descent. You can also use stochastic and minibatch

Normal equations

In this case, we wish to show that it is possible to just solve the optimization problem in one go, without resorting to an iterative approach.

Design matrix

Define a design matrix XX to be a n×d+1n \times d + 1 matrix. This means that it has nn rows for nn training points (batch size) and d+1d + 1 features (the one extra features is due to the x0x_0.

Let yy be the vector containing all the target values

The prediction is thus given by

XθX\theta

Solving for closed form

As such, the error is just given by

J(θ)=12(Xθy)T(Xθy)J(\theta) = \frac{1}{2}(X\theta - \vec{y})^T(X\theta - \vec{y})

We can derive the gradient through matrix calculus.

This should be pretty straightforward. The only tricky part is understanding the following forms

And that the inner product is symmetric

this means that the optimal parameter is given by

θ=(XTX)1XTy\theta^* = (X^TX)^{-1}X^Ty

Connection to Newton's method

This is just newton's method

θ=[θ2J]1θJ\theta^* = [\nabla^2_\theta J]^{-1}\nabla_\theta J

when the gradient is evaluated at θ=0\theta = 0. This is why:

θ=[θ2J]1θJθ=(XTX)1(XTXθXTy)θ=θ(XTX)1XTy\theta^* = [\nabla^2_\theta J]^{-1}\nabla_\theta J \\ \theta^* = (X^TX)^{-1}(X^TX\theta - X^Ty) \\ \theta^* = \theta - (X^TX)^{-1}X^Ty

So if we started when θ=0\theta = 0, then newton's method takes one step to converge!

Probabilistic Interpretation

Let's try to understand this one other way: through probabilisty. Let y=θTx+ϵy = \theta^Tx + \epsilon, where ϵ\epsilon is sampled from an IID gaussian distribution. In other words,

And because ϵ=yθTx\epsilon = y - \theta^Tx, we get

Now, we can understand this to be the likelihood of yy occuring given that xx happens. When we consider the whole dataset, we get

We will actually maximize the log likelihood, because it allows us to remove the product

To maximize this log likelihood, we note that we should make the second term that is being subtracted as small as possible. To do so, we minimize

which is EXACTLY what we are looking for!

Generalizing this

This ϵ\epsilon doesn't need to be distributed like a Gaussian. It can be any distribution that has some density function f(z)f(z). Then, you just plug in f(yθTx)f(y - \theta^Tx) into the density function. Calculate the log likelihood as usual, and this becomes your objective (as a function of θ\theta)

Locally Weighted Linear Regression

The idea is pretty simple: make the gradients of each point contribute differently. In particular, if you want to predict point xx, you want the model to fit the points around that point better than the others. Our objective becomes w(i)(y(i)θTx(i))2\sum w^{(i)} (y^{(i)} - \theta^Tx^{(i)})^2. We typically define the weights as

This can be generalized into a vector-valued function as well. The τ\tau is the bandwith parameter. The smaller the τ\tau, the tighter the distribution and the narrower the window will be. This can result in overfitting. The larger the τ\tau, the looser the distribution and the wider the window will be. This cna result in underfitting.

At run-time, you will take the xx you want to predict and run the regression ad-hoc, and then make a prediction. Because we need to retrain the model for every prediction, this is a non-parameteric algorithm. In other words, you can't pre-bake the algorithm and let it run.

The intuition

Suppose that you had something like this. The shape is clearly well-defined, but it might be very difficult to find a good model that works.

Rather, it might be easier to locally approximate linearity at the points that you want to predict

And from here, you can get pretty good predictions!

Polynomial regression

Polynomial regression is just linear regression in a trench coat. In other words, we can define a feature map ϕ(x)\phi(x) that takes in xx and outputs [1,x,x2,...xk][1, x, x^2, ... x^k]. In other words, you are taking one point and expanding it into a higher dimensional space. To fit θ\theta, you are still doing linear regression in θ\theta, but you end up with a polynomial regression. Neat!! (which means that polynomial regression is also provably convex)

The objective becomes

=i=1n((hθ(ϕ(x(i)))y(i))ϕ(x(i))j) = \sum_{i = 1}^n\left((h_\theta(\phi(x^{(i)})) - y^{(i)})\phi(x^{(i)})_j\right)

This goes beyond polynomial regression; you can use any feature map ϕ\phi. Because the feature map doesn't get differentiated, the same optimizer can be used.

Parametric vs non-parameteric models

A parametric model is a model that allows you to extract some parameter θ\theta from the training set, from which you can use for predictions. You can discard the training set after θ\theta.

A non-parametric model is a model that perhaps requires some parameter but also requires the training set to make predictions. A good example is anything with a kernel method.