In case of classification, we define a set of discriminant function , and then we
If we first estimate the prior probabilities, and the class likelihoods and use Bayes' rule to calculate the posterior densities, and define discriminant functions in terms of the posterior, for example
then this is called likelihood-based classification.
Now we are going to discuss discreiminant-based classification where we assume a model directly for the discriminant, bypassing the stimation of likelihood or posteriors. The discriminant-based approach makes no assumption about the density.
We define a model for the discriminant
explicitly parameterized with the set of parameters , as opposed to a likelihood-based scheme that has implicit parameters in defining the likelihood densities.
In the discriminant-based approach, we do not care about correctly estimating the density inside the class regions. All we care about is the correct estimation of the boundaries between the class regions. This is true only when the discriminant can be approximated by a simple function.
We now define a linear discriminant function:
or a vector of discriminant functions
The linear discriminant is used frequently mainly due to its simplicity: both space and time complexity are . The model is also explainable for the maginitude of the weight shows the importane of .
The linear discriminant should be tried first before trying a complicated model to make sure the additional complexity is justified.
Generalizing the linear model
If we start by adding a quadratic term
though it is more general, it requires much larger training sets and may overfit on small samples.
An equivalent way is to preprocess the input by adding higher-oder terms, also called product terms. A linearly inseparable problem can become linearly separable in this new space. Our discriminant becomes
where are called basis functions. Some examples are
Geometry of the Linear Discriminant
Starting with a two-class classifier. In such a case, one discriminant function is sufficent:
where is the weight vector and is the threshold. The hyperplane divides the input space into two half-spaces. Taking two points and on the decision surface, , then
This mean is perpendicular to any vector on the hyperplane, hence the normal vector of the hyperplane itself. This allows us to unambigously write any as
where is the projection of on the hyperplane, with , and factor on is to make sure the physical meaning of
We can solve
In the muliclass case, the usual approach is to assign to the class having the highest discriminant
Logistic regression and softmax
If the class density are Gaussian and share a common covariance matrix, the discriminant function is linear
where the parameters can be analytically calculated as
In the case of two classes, we define , we
We now want to use the logit transformation on :
It is the difference between the log posterior between the cases. In our case, logit is linear:
This means we can use to predict :
The inverse of logit function is sigmoid function:
Now generalize this to multi class cases. Using the previous Gaussian assumption, we get:
It looks like it can be writen as the difference of two linear functions and , and
for some constant . From there
This is called the softmax function.
Node: We don't need to compute and plugin the constant when we compute this function, because softmax function is translate invariant. In fact, in most of the cases, we don't even need to evaluate the posterior. When it does, however, we need to compute the exponential function of some potentially big number and cause numerical issues (say, issue). For this, we can instead shift all the values down by the maximum number among . Then we will only deal with numbers between and .