###
Machine Learning Study Note - Linear models

Last updated: 2017-09-25 19:30:12 PDT.

### Introduction

In case of classification, we define a set of discriminant function , and then we

If we first estimate the prior probabilities, and the class likelihoods and use Bayes' rule to calculate the posterior densities, and define discriminant functions in terms of the posterior, for example

then this is called**.**

*likelihood-based classification*Now we are going to discuss ** discreiminant-based classification** where we assume a model directly for the discriminant, bypassing the stimation of likelihood or posteriors. The discriminant-based approach makes no assumption about the density.

We define a model for the discriminant

explicitly parameterized with the set of parameters , as opposed to a likelihood-based scheme that has implicit parameters in defining the likelihood densities.In the discriminant-based approach, we do not care about correctly estimating the density inside the class regions. All we care about is the correct estimation of the *boundaries* between the class regions. This is true only when the discriminant can be approximated by a simple function.

We now define a linear discriminant function:

or a vector of discriminant functionsThe linear discriminant is used frequently mainly due to its simplicity: both space and time complexity are . The model is also explainable for the maginitude of the weight shows the importane of .

The linear discriminant should be tried first before trying a complicated model to make sure the additional complexity is justified.

### Generalizing the linear model

If we start by adding a quadratic term

though it is more general, it requires much larger training sets and may overfit on small samples.An equivalent way is to preprocess the input by adding ** higher-oder terms**, also called

**. A linearly inseparable problem can become linearly separable in this new space. Our discriminant becomes**

*product terms***. Some examples are**

*basis functions*### Geometry of the Linear Discriminant

Starting with a two-class classifier. In such a case, one discriminant function is sufficent:

where is the**and is the**

*weight vector***. The hyperplane divides the input space into two half-spaces. Taking two points and on the decision surface, , then This mean is perpendicular to any vector on the hyperplane, hence the normal vector of the hyperplane itself. This allows us to unambigously write any as where is the projection of on the hyperplane, with , and factor on is to make sure the physical meaning of We can solve**

*threshold*In the muliclass case, the usual approach is to assign to the class having the highest discriminant

### Logistic regression and softmax

If the class density are Gaussian and share a common covariance matrix, the discriminant function is linear

where the parameters can be analytically calculated as In the case of two classes, we define , we We now want to use the logit transformation on : It is the difference between the log posterior between the cases. In our case, logit is linear: This means we can use to predict : The inverse of logit function is sigmoid function:Now generalize this to multi class cases. Using the previous Gaussian assumption, we get:

It looks like it can be writen as the difference of two linear functions and , and for some constant . From there This is called the**.**

*softmax function*Node: We don't need to compute and plugin the constant when we compute this function, because softmax function is translate invariant. In fact, in most of the cases, we don't even need to evaluate the posterior. When it does, however, we need to compute the exponential function of some potentially big number and cause numerical issues (say, issue). For this, we can instead shift all the values down by the maximum number among . Then we will only deal with numbers between and .